Monitoring systems in DevOps

Are you able to define „big”? Perhaps you know how much is “too much”? Or is “little” enough or not? If you cannot answer these questions than you are not alone! Probably no one can. If you are wondering why, I am going to show what to do so you are not lost.

One of the most important factors when you launch your service is the end user’s experience. All in all, these people pay for your services. You want their experience to be joyful, smooth and enjoyable. You do want them to come back!

Have you ever been angry with a website that took ages to load? Probably yes. So if you provide your customers with web services you simply do not want them to wait. Research shows that a typical customer can wait up to three seconds before leaving. It is not long time…

If you are monitoring your service you might want to check the aforementioned load time. Usually “all green” means you are able to show the website in less than 3 seconds (you might have additional things downloading in the background but the user will not notice). If the time is longer (but not much more!) you should get warning. You might want to limit the time to 15-20 seconds. If it takes longer than sorry, you are offline. There are deviations from the rule, obviously. If you want to buy tickets for a music show or football match, together with 50 000 other fans, all at the same time, there is a high probability you will be queued and you will have to wait long minutes or even hours. But these are not everyday rule.

So how to start monitoring your service? You might check availability on your own by pressing “Refresh” button but you will not be able to do it every few seconds, 24/7/365. That is when the monitoring teams and systems come into play.

First of all, you need to quantify your needs. You need to give an answer to some basic questions such as “How long my customer can wait for the website to load?”. By giving answer do such questions you will be specifying KPIs – Key Performance Indicators. When the indicators are stated you are ready to go.

The whole monitoring system should be set up in such a way that it can identify an incident when it occurs. It is good to define an incident as a moment when the end user sees that something bad is happening, e.g., web store is not working and they get a blank or error page. Whenever possible, the system should try to repair itself. Thanks to such setup, the issue will not be visible for a long time. On the other hand, if the automated procedure was not able to bring the service back to life than this is when a human comes into play and analyses the problem. You can choose if you prefer to start with the first line of support filled with, e.g., junior DevOps (or Network Operations Center’s Monitoring) Engineers or to transfer it directly to more senior personnel. It is a good practice to start with juniors since this is a great opportunity for them to learn and gain real life experience.

If the issue is too challenging for the first line of support then the second line is being called in and they work together in order to correct the issue. If they still are unable to find a fix, then the third line gets involved – they can be developers. Usually it gets corrected by then.

After heading off an incident, it is a good practice to perform small investigation in order to answer the main questions: What went wrong? and How can we prevent it from happening again? These questions are rarely answered by one person. More often it is a whole team (or teams!) who find best solutions and implement them. But this is a story for different post.

Figure 1 AT&T's Global Network Operations Center. Source: https://www.turnerconstruction.com/experience/project/653E/att-network-operations-center
Figure 1 AT&T’s Global Network Operations Center. Source: https://www.turnerconstruction.com/experience/project/653E/att-network-operations-center
Remigiusz Pospieszyński