I started my professional career in IT operation at a large automotive company, where I supported the company’s brand websites. There I learned the importance of a good monitoring system, which supports IT operations in detecting problems early and accurately. And I also learned that even enterprise IT monitoring systems are fed best with a dead-simple HTML page containing the string “OK”. Or some other string, which then means “Oh, something’s wrong!”.
In this post I want to give you some impression about the problematics of application monitoring, especially with Sling health checks in mind. Because it isn’t as simple as it sounds in the first place, and you can do things wrong. But every application should posses the ability to deliver some information about its current status, as the cockpit in your car gives you information about the available gas (or electricity) in your system.
The problem of async error reporting
Consolidating information
When you have many individual data, but you need to build a single data point about for a certain timeframe (say 10 seconds), you need to come up with a strategy to consolidate them. A common approach is to collect all individual results (e.g. just single “ok” or “not ok” information) and adding them to a list. When the health check status needs to be calculated, this list is iterated and the number of “OKs” and “not oks” is counted, the ratio is calculated and reported; and after that the list is cleaned, and the process starts again.
When you design such consolidation algorithms, you should always keep in mind how errors are reported. In the above mentioned case, 10 seconds full of errors would be reported only for a single reporting cycle as CRITICAL. The cycle before and after could be OK again. Or if you have larger cycles (e.g. 5 minutes for your Nagios) think how 10 seconds of errors are being reported, while in the remaining 4’50’’ you don’t have no problem at all. Should it reported with the same result as you have the same number of errors spread over this 5 minutes? How should this case be handled if you have decided to ignore an average rate of 2% failing transactions?