Sling healthchecks – what to monitor

The ability to know what’s going on inside an application is a major asset whenever you need to operate an application. As IT operation you don’t need (or want) to know details. But you need to know if the application works “ok” by any definition. Typically IT operations deploys alongside with the application a monitoring setting which allows to make the statement, if the application is ok or not.

In the AEM world such statements are typically made via Sling healthchecks. As a developer you can write assertions against the correct and expected behaviour of your application and expose the result. While this technical aspect is understood quite easily, the more important question is: “What should I monitor”?

In the last years and projects we typically used healthchecks for both deployment validation (if the application goes “ok” after a deployment, we continued with the next instance) and loadbalancer checks (if the application is “ok”, traffic is routed to this instance). This results in these requirements:

The status must not fluctuate, but rather be stable. This normally means, that temporary problems (which might affect only a single request) must no influence the result of the healthcheck. But if the error rate exceeds a certain threshold it should be reported (via healthchecks)
The healthcheck execution time shouldn’t be excessive. I would not recommend to perform JCR queries or other repository operations in a healthcheck, but rather consume data which is already there, to keep the execution time low.
A lot of the infrastructure of an AEM instance can be implicitly monitored. If you expose your healthcheck results via a page (/status.html) and this page results with a status “200 OK”, then you know, that the JVM process is running, the repo is up and that the sling script resolution and execution is working properly.
When you want your loadbalancer to determine the health status of your application by the results of Sling healthchecks, you must be quite sure, that every single healthcheck involved works in the right way. And that an “ERROR” in the healthcheck really means, that the application itself is not working and cannot respond to requests properly. In other words: If this healthchecks goes to ERROR on all publishs at the same time, your application is no longer reachable.

Ok, but what functionality should you monitor via healthchecks? Which part of your application should you monitor? Some recommendations.

Monitor pre-requisites. If your application constantly needs connectivity to a backend system, and is not working properly without it, implement a healthcheck for it. If the configuration for accessing this system is not there or even the initial connection start fails, let your healthcheck report “ERROR”, because then the system cannot be used.
Monitor individual errors: For example, if such a backend connection throws an error, report it as warn (remember, you depend on it!). And implement an error threshold for errors, and if this threshold is reached, report it as ERROR.
You should implement a healthcheck for every stateful OSGI service, which knows about “success” or “failed” operations.
Try Avoid the case, that a single error is reported via multiple healthchecks. On the other side try to be as specific as possible when reporting. So instead of “more than 2% of all requests were answered with an HTTP statuscode 5xx” via healthcheck1 you should report “connection refused from LDAP server” in healthcheck 2. In many cases fundamental problems will trigger a lot of different symptoms (and therefor cause many healthchecks to fail) and it is very hard to change this behaviour. In that case you need to do document explicitly how to react in such responses and how to find the relevant healthcheck quickly.

Regarding the reporting itself you can report every problem/exception/failure or work with the threshold.

Report every single problem if the operations runs rarely. If you have a scheduled task with a daily interval, the healthcheck should report immediately if it fails. Also report “ok” if it works well again. The time between runs should give enough time to react.
If your operation runs very often (e.g. as part of every request) implement a threshold and report only a warning/error if this threshold is currently exceeded. But try to avoid constantly switching between “ok” and “not ok”.

I hope that this article gives you some idea how you should work with healthchecks. I see them as a great tool and very easy to implement. They are ideal to perform one-time validations (e.g. as smoketests after a deployment) and also for continous observation of an instance (by feeding healthcheck data to loadbalancers). Of course they can only observe the things you defined up front and do not replace testing. But they really speedup processes.

When is AEM fully started?

Or in other words: How can I know that the instance is fully working?

A common task when you work with automation is a realiable detection when the AEM instance is up and running. Maybe you reconfigure the loadbalancer to send requests to this instance. Or you just start doing some other work.

The most naive approach is to request a AEM page and act on the HTTP status code. If the status is “200”, you consider the system up and running. If you get any other code, it’s not. Sounds easy, is easy. But not really accurate. Because there are times during startup, when the system returns a status code 200, but a blank page. Unfortunate.

So next approach: Check if all bundles are active. Check /system/console/bundles.json and parse it. Look for a statement like this:

status":"Bundle information: 447 bundles in total - all 447 bundles active.

Nice try, but does not work. All bundles being up does not guarantee, that all the services are up as well.

The third approach is more compplicated and requires coding, but delivers good results: Build a healthcheck which depends on a lot of other services (the ones you consider important). If this healthcheck is present and delivers ok, it means, that all services it depends on are active as well (the simple default semantic of the @Reference annotation guarantees that). This does not necessarily mean, that the startup is finished, but just that the services you considered relevant are up.

And finally there is a fourth approach, which has been built specifically for this case: The startup listeners. It’s a service interface you can implement, and you get notified when the system is up. That’s it. The API does not give any guarantee that if the system is up, that 5 minutes later it is still up. I am not 100% sure so the semantics of this approach if a service fails to start. Or if a service decides to stop (or starts throwing exceptions).

The healthcheck is my personal favorite. It can be used not only to give you information about a single event (“the system is up”), but it can take much more factors into account to decide if the system is up. And these factors can be constantly checked. When a service is no longer available, the healthcheck goes to ERROR (“red”), and it’s available again, the healthcheck reports OK again. The approach is more powerfull, provides better extensibility and is quite easy to understand. So I choose a healthcheck everytime when I need to know about the health state of AEM.

Writing health checks — the problem

I started my professional career in IT operation at a large automotive company, where I supported the company’s brand websites. There I learned the importance of a good monitoring system, which supports IT operations in detecting problems early and accurately. And I also learned that even enterprise IT monitoring systems are fed best with a dead-simple HTML page containing the string “OK”. Or some other string, which then means “Oh, something’s wrong!”.

In this post I want to give you some impression about the problematics of application monitoring, especially with Sling health checks in mind. Because it isn’t as simple as it sounds in the first place, and you can do things wrong. But every application should posses the ability to deliver some information about its current status, as the cockpit in your car gives you information about the available gas (or electricity) in your system.

The problem of async error reporting

The health check is executed when the reports are requested, so you cannot just push your error information to the health check as you log them during the processing. Instead you have to write them to a queue (or any other data structure), where this information is stored, until it is consumed.

The situation is different for periodical daily jobs, where only 1 result is produced every day.

Consolidating information

When you have many individual data, but you need to build a single data point about for a certain timeframe (say 10 seconds), you need to come up with a strategy to consolidate them. A common approach is to collect all individual results (e.g. just single “ok” or “not ok” information) and adding them to a list. When the health check status needs to be calculated, this list is iterated and the number of “OKs” and “not oks” is counted, the ratio is calculated and reported; and after that the list is cleaned, and the process starts again.

When you design such consolidation algorithms, you should always keep in mind how errors are reported. In the above mentioned case, 10 seconds full of errors would be reported only for a single reporting cycle as CRITICAL. The cycle before and after could be OK again. Or if you have larger cycles (e.g. 5 minutes for your Nagios) think how 10 seconds of errors are being reported, while in the remaining 4’50’’ you don’t have no problem at all. Should it reported with the same result as you have the same number of errors spread over this 5 minutes? How should this case be handled if you have decided to ignore an average rate of 2% failing transactions?

You see that you can you can spend a lot of thinking on it. But be assured: Do not try to be to sophisticated. Just take a simple approach and implement it. Then you’re better than 80% of all projects: Because you have actually reasoned about this problem and decided to write a health check!

The future of CQ Healthcheck

A few days ago Sling Healthcheck 1.0 was released. My colleague Betrand Delacretaz intiated and pushed that project and he did a great job. And now there’s a monitoring solution where it belongs: In the sling framework, on Apache, with a much greater visibility my personal pet project would ever get. I don’t have the the possibility to spend much time on it, and in fact never wanted to run such a thing on my own. Fouding the CQ healthcheck project was necessary to push the topic “monitoring” and to make it visible. And now I am glad, that Betrand picked it up. I fully trust him that he will push it on the Sling side and also inside Adobe, so we can expect to see the healtcheck functionality in the next CQ release. And that’s exactly what I ever wanted to achieve: A usable and extensible monitoring solution available out of the box in CQ.

So, I think, that the mission of the CQ healthcheck project is over; so I will discontinue the development on github. I will leave the code there and you can still fork it and restart the development.

CQ5 healthcheck: backport for CQ 5.4

I learned, that there a quite a number of projects out there, which are (still?) bound to CQ 5.4 and cannot move forward to a newer version right now. For these I created a backport of the healthcheck version 1.0, which works reasonable well on my personal instance of CQ 5.4. You can find the code on github in the release-1.0-cq54 branch, but I don’t provide a compiled binary version.

The main changes to the master branch:

I backported the 1.0 branch, not master. Currently the changes aren’t that hard, so you can maintain a branch “master-cq54” on your own.
Adjusted pom files; no code changes required due to this, but only
The PropertiesUtil class is not there, but you can replace 1:1 with the OsgiUtil class available in CQ 5.4
use “sling:OsgiConfig” nodes instead of nt:files nodes with the extension “.config” (the later is available on CQ 5.5 and later)
CQ 5.4 does not support sub-folders within the config folder, you need to put all config nodes there.

And of course the biggest limitation:

For replication there is no ootb JMX support, therefor I dropped the respective config nodes.
If you want to contribute support for this feature, you’re welcome 🙂

So have fun with it.

CQ5 healtcheck — how to use

The recent announcement of my healthcheck project caused some buzz, most related to how it can be used. So I want to show you, how you can leverage the framework for you.

The statuspage

First, the package already contains a status page (reachable via <host>/content/statuspage.html), which looks like this:

The first relevant piece of information is the “Overall Status”: It can be “OK”,”WARN” or “CRITICAL”.

This information is computed out of all the invidual checks which are listed in the details table according to this ruleset:

If a least 1 check returns CRITICAL, the overall status is “CRITICAL”.
If at least 1 check returns WARN an no check returns CRITICAL, the overall status is “WARN”.
If all status return OK, the overall status is “OK”.

The overall status is easily parseable on the statuspage by a monitoring system.

The indivual checks are listed by name, status and an optional message. This list should be used to determine which check failed and caused the overall status to deviate from OK.

The status in detail:

OK: obvious, isn’t it?
WARN: the checked status is not “OK, but also not CRITICAL. The system is still usable, but you need to observe the system closer, or need to perform some actions, so the situation won’t get worse.
CRITICAL: The system should not be used and user experience will be impacted. Actions required.

Managing the loadbalancers

Any loadbalancer in front of CQ5 instances also should be aware of the status of the instance. But loadbalancers probes much more often (about every 30 seconds), and they don’t have that much capabilities to parse complex data. For this usecase there is the “/bin/loadbalancer” servlet, which returns only “OK” with a statuscode 200, or “WARN” with a statuscode “500”. WARN indicates both WARN and CRITICAL, in both cases it is assumed, that the loadbalancer should not send requests to that instance.

That’s for now. If you have feedback, mail me or just create an issue at github.

CQ5 healthcheck version 1.0

My colleague Alex already disclosed it already in early December, but it was still not ready. But in the meantime I think, that it’s actual time to release it.

So here it is: The CQ5 health check. A small and easy to understand framework to monitor the status of your CQ5 instance. It’s main features are

Usable out of the box.
All MBeans can be monitored just by configuration
Extendable by simple OSGI services
Features an extended status page as well as a machine interface for automatic monitoring

And the best: the source is freely available on Github; a package ready for installation is available on Packageshare , and the installation is very easy:

Download and install the package from package share
Goto http://<your-instance>:4502/content/statuspage.html
Enjoy

So feel free to download it,install it, fork it, extend it. The code is licensed under Apache License, so you don’t have to disclose your extensions and modifications at all. But I love to get contributions back 🙂

So, currently the most useful informations are stored in the README file, but I hope that I can move this information over to the project wiki. This is just the announcement; I plan to add some posts to this blog how you can write your own health checks (which isn’t hard by the way).

Enjoy your new toy, and I love to get feedback from you, either here on the blog, via twitter (@joerghoh) or in geek-style via pull-requests.

And many thanks to Alex Saar and Markus Haack for they support and contributions.