cover-img

why i decided to implement infrastructure monitoring

The struggles of running high performance infrastructure

28 January, 2022

4

4

0

Managing many clusters in a distributed environment is not a pipeline where you design and implement an efficient CI/CD pipeline with a number of test cases or implement a Gitops pipeline that automatically reconciles your repository code state and your kubernetes state. Well here its not about kubernetes but physical servers running Linux operating system aggregated together to offer high performance.
I found it easy to manage and monitor applications and offer observability aspects to performance and maintaining high availability. While i enjoyed the laxity offered with efficient instrumentation i started running into much bigger issues,every day there were cpu failures,kernel crushing etc and the worst part was diving into the logs was even worse than just trying to think over the solutions.
Well things worked and with time i started feeling like the Linux guru i wanted to be. Over time it was more of break and fix,then came Prometheus,at first i was so skeptical in trying it out but a few thoughts on avoiding too much recovery session decided to implement Prometheus and build in metrics that were being displayed on Grafana.
Now i could monitor,view and alert on system metrics that were beyond threshold but still my servers were gradually failing due to errors that "maybe i had not thought of alerting". Logging can be extremely difficult but a good logging practice can help you find a symptom of a failure before it occurs.
Elk was my tool of choice and after a few weeks of " not worrying about failures", the whole system went down and this time all the monitoring and logging platforms were offline. Not to panic, it was a just a network failure(DNS) that was not reachable !!

4

4

0

andrew espira
Cloud computing engineering, Devops and SRE. Passionate about Cloud Native Technologies

More Articles