Sling healthchecks – what to monitor

The ability to know what’s going on inside an application is a major asset whenever you need to operate an application. As IT operation you don’t need (or want) to know details. But you need to know if the application works “ok” by any definition. Typically IT operations deploys alongside with the application a monitoring setting which allows to make the statement, if the application is ok or not.

In the AEM world such statements are typically made via Sling healthchecks. As a developer you can write assertions against the correct and expected behaviour of your application and expose the result. While this technical aspect is understood quite easily, the more important question is: “What should I monitor”?

In the last years and projects we typically used healthchecks for both deployment validation (if the application goes “ok” after a deployment, we continued with the next instance) and loadbalancer checks (if the application is “ok”, traffic is routed to this instance). This results in these requirements:

  • The status must not fluctuate, but rather be stable. This normally means, that temporary problems (which might affect only a single request) must no influence the result of the healthcheck. But if the error rate exceeds a certain threshold it should be reported (via healthchecks)
  • The healthcheck execution time shouldn’t be excessive. I would not recommend to perform JCR queries or other repository operations in a healthcheck, but rather consume data which is already there, to keep the execution time low.
  • A lot of the infrastructure of an AEM instance can be implicitly monitored. If you expose your healthcheck results via a page (/status.html) and this page results with a status “200 OK”, then you know, that the JVM process is running, the repo is up and that the sling script resolution and execution is working properly.
  • When you want your loadbalancer to determine the health status of your application by the results of Sling healthchecks, you must be quite sure, that every single healthcheck involved works in the right way. And that an “ERROR” in the healthcheck really means, that the application itself is not working and cannot respond to requests properly. In other words: If this healthchecks goes to ERROR on all publishs at the same time, your application is no longer reachable.

Ok, but what functionality should you monitor via healthchecks? Which part of your application should you monitor? Some recommendations.

  1. Monitor pre-requisites. If your application constantly needs connectivity to a backend system, and is not working properly without it, implement a healthcheck for it. If the configuration for accessing this system is not there or even the initial connection start fails, let your healthcheck report “ERROR”, because then the system  cannot be used.
  2. Monitor individual errors: For example, if such a backend connection throws an error, report it as warn (remember, you depend on it!). And implement an error threshold for errors, and if this threshold is reached, report it as ERROR.
  3. You should implement a healthcheck for every stateful OSGI service, which knows about “success” or “failed” operations.
  4. Try Avoid the case, that a single error is reported via multiple healthchecks. On the other side try to be as specific as possible when reporting. So instead of “more than 2% of all requests were answered with an HTTP statuscode 5xx” via healthcheck1 you should report “connection refused from LDAP server” in healthcheck 2. In many cases fundamental problems will trigger a lot of different symptoms (and therefor cause many healthchecks to fail) and it is very hard to change this behaviour. In that case you need to do document explicitly how to react in such responses and how to find the relevant healthcheck quickly.

Regarding the reporting itself you can report every problem/exception/failure or work with the threshold.

  • Report every single problem if the operations runs rarely. If you have a scheduled task with a daily interval, the healthcheck should report immediately if it fails. Also report “ok” if it works well again. The time between runs should give enough time to react.
  • If your operation runs very often (e.g. as part of every request) implement a threshold and report only a warning/error if this threshold is currently exceeded. But try to avoid constantly switching between “ok” and “not ok”.

I hope that this article gives you some idea how you should work with healthchecks. I see them as a great tool and very easy to implement. They are ideal to perform one-time validations (e.g. as smoketests after a deployment) and also for continous observation of an instance (by feeding healthcheck data to loadbalancers). Of course they can only observe the things you defined up front and do not replace testing. But they really speedup processes.