How to use Runmodes correctly (update)

Runmodes are an essential concept within AEM; they form the main and only way to assign roles to AEM instances; the primary usecase is to distinguish between the author and publish role, and another common usecase is also to split between PROD, Staging and Development environments. Technically it’s just a set of strings which are assigned to an instance, and which are used by the Sling framework at a few occassions, the most prominent being the Sling JCR Installer (which handles the /apps/myapp/config,/apps/myapp/config.author, etc. directories).

But I see other usecases; usecases where the runmodes are fetched and compared against hardcoded strings. A typical example for it:

boolean isAuthor() { return slingSettingsService.getRunmodes().contains("author"); }

From a technical point of view this is fully correct, and works as expected. The problem arises when some code is based on the result of this method:

if (isAuthor()) { // do something }

Because now the execution of this code is hardcoded to the author environment; which can get problematic, if this code must not be executed on the DEV authoring instances (e.g. because it sends email notifications). It is not a problem to change this to:

if (isAuthor() && !isDevelopmentEnvironment()) { // do something }

But now it is hardcoded again 😦

The better way is to rely on the OSGI framework soley. Just make your OSGI components require some configuration and define the configuration for the runmodes required.

@Component(configurationPolicy=ConfigurationPolicy.REQUIRED) public class myServiceImpl implements myService { //...

This case requires NO CODING at all, instead you can just use the functionality provided by Sling. And this component does not even activate if the configuration is not present!

Long story short: Whenever you see a reference to SlingSettingsService.getRunmodes(), it’s very likely used wrongly. And we can generalize it to “If you add a reference to the SlingSettingsService, you are doing something wrong”.

There are only a very few cases where the information provided by this service is actually useful for non-framework purposes. But I bet you are not writing a framework 🙂

Update (Oct 17, 2019): In a Twitter discussion Ahmed Musallam and Justin Edelson pointed out, that there are usecases around where this actually useful and the right API to use. Possibly yes, I cannot argue about that, but these are the few cases I mentioned above. I have never encountered them personally. And as a general rule of thumb it’s still applicable. Because every rule has its exceptions.

You think that I have written on that topic already? Yes, I did, actually 2 times already (here and here). But it seems that only repetition helps to get this message through. I still find this pattern in too many codebases.

Java 7 support for CQ5?

As Java 7 has been launched these days, the question arises real soon: “Does Adobe support Java 7 as runtime for CQ5?”.

So, the clear answer is: No, it isn’t supported. Mostly because of some issues which can cause corruptions in the Lucene index. So, of course, you give it a try and tackle the risk yourself (as you can run CQ5 on a Windows 7 box or on Debian Linux); but don’t complain, if you receive some strange behaviour.

ps: Just adding -XX:-UseLoopPredicate to your JVM parameters won’t solve the problem (according to the Lucene Website).

CQ5 logging

This week I held a workshop at a customer and someone asked me “How do other customers of Day handle their logfiles? Do they check them and analyze the logfiles?” I had to admit that “according to my experience nobody really cares about them. The only situation they care about them is when the disk is full f them.” Yeah, a sad truth.

But this brings us to todays topic: Logfiles and keeping track of them. CQ5 is by default pretty noisy; if you check the file crx-quickstart/logs/error.log after some requests have been made, you see a lot of messages of loglevel “INFO”. Yes, sometimes quite interesting, but in the end they pollute the log and the real important messages vanish in the pure mass of these noise. So, at least for production systems, the loglevel should incrased to WARN or even “ERROR”, so only logs at level WARN or ERROR are logged, INFO is supressed.

So, how can this be achieved? Sling as part of the WCM part of CQ5 brings its own logging, it can be configured using the Felix console and is well documented on the Day documentation site. CRX (at least up to CRX 2.1) does have its own logging mechanisms (log4j), which can be reconfigured in the crx-quickstart/server/runtime/0/_crx/WEB-INF/log4j.xml file.

And, on top of this all, we have on a standard Unix system

crx-quickstart/logs/stderr.log and crx-quickstart/logs/stdout.log
crx-quickstart/logs/server.log
crx-quickstart/server/logs/startup.log

neat, isn’t it? Ok, how can you configure them?

Short answer: you can’t. At least it isn’t documented.

The stdout.log and stderr.log and the standard output and standard error channel of the java process, which is redirected to these files. Especially stdout.log fills up pretty fast, because CRX logs all its messages also to the stdout. So fixing up the log4j.xml file is mandatory, because we don’t need this information twice in the crx/error.log and the stdout.log file. Oh, and of course these files aren’t rotated, but new data is appended only. So it grows and grows and grows.

The server.log file is written by the CQSE servlet engine and cleared when the servlet engine is started. Same as for the startup.log, which contains the output of the serverctl script before starting the java and also error messages, if the java process doesn’t start at all (most times due to invalid parameters).

A few recommendations (just a personal point of view):

Log rotation should be performed on a timely basis and not be based on the size of the logfile. You should have enough space then and monitor it closely, of course. But this helps you to lookup a certain problem (“Wait, it was yesterday, so it must be in error.log.0 file”) without hassles.
implement your own logfile rotation for the stdout.log and stderr.log files. I fill a bug for it too, but till then you need to help yourself. Sorry.
Increase loglevel to WARN. INFO just logs too much noise.
Adjust the log4j.xml of CRX and change it to something like this:

<root>
 <level value="warn" />
 <appender-ref ref="error"/>
 </root>

So adjusting the logging according to your needs shows, that you care about them and know, that they are useful at all. Which is a required step to do some analysis on them. But that topic is a candidate for one of the next postings.

Application monitoring vs System monitoring

Recently I was asked how a CQ monitoring should be setup. The question was acompanied by a very short description, how the monitoring was supposed to look like. There were some points like “monitoring the publishing by requesting pages”, “check if the java process is running” and “checking the free disk space”. Obviously they just setup some new servers for this environment and thought that they need to monitor some parameters.

As a first step I advised to separate the topics “application monitoring” and “system monitoring”. One might wonder why I suggest to make a strong division between these topics, so here the background.

Standardization is one of the key topics in IT; everything, what is standardized, can be reused, can be exchanged by a compatible product, and finally lowers the cost. So IT operation teams tend to standardize as much as they can, because as intermediate step to lower costs standardization allows automation.

Basic system monitoring is such a thing. Every computer has componentes, which can be monitored such way: Disk health, CPU temperature, status of the power supply units, internal temperature. But also CPU utlization, free disk space, network connectivity or if the system starts to swap. And many more. These are basic metrics which can be measured and monitored in a consistent and automatic way.

For these points it doesn’t care if the system runs a data warehouse application, a mailserver or CQ. They are all the same and the reaction is really comparable if one of these monitored things fails: If a disk is dead, one needs to replace it (with not-so-old servers you can do this online and without service interruption). The procedure may differ from computer to computer, but the basic action is always the same: When the monitoring shows that a disk failed, lookup the type of the failed disk, get a new one, and go the computer and replace it according to the guidelines of the computer manufacturer. That’s it. You can handle some thousand servers that way with only a few people.

Running applications isn’t standardized that way. One application requires a Windows Server, other run because of their history only on big iron. One vendor offers performance guarantees only for linux systems, and other vendors don’t care about the platform as long as they have a Websphere Application Server as base. Some applications are designed to run centralized, other applications can be clustered. Some have good logging and messages you can use for diagnosis, others don’t have that and error causes must be detected with system tools like truss or strace.
So applications are highly non-standardized and often need special skill and knowledge in order to operate them. Automatisation is a very hard job here, and there must be support by management to get every part of the organisation in the right direction.

(As a side note: In my former life before I joined Day I worked in a large IT operation organisation where every application was somehow non-standard; some less, but also some completly out of every order. IT tried its best to create some kind of standardization, but the busineses often didn’t care that much about it; also developers didn’t knew much about IT operations, so “but it works on my machine!!” and “Just open the firewall, so these 2 components can talk to each other” was often heard in early project stages.)

These applications also need completly different kinds of monitoring. The implementation for SAP monitoring looks different than the application monitoring for a web application. The actions the take in case of problems probably differ even more; and when it comes to investigate on errors the webapplication administrator cannot do anything on the SAP system. And vice-versa.

So it’s advisable to separate the monitoring into 2 parts: The basic system monitoring and the application monitoring.

The system monitoring part can be done by one team for all servers. The application monitoring is too complex and too different, the actions sometime require so often special knowhow, that it must be adjustable to the needs of each application and application administrators.

As a final conclusion: Everytime a computer system is setup, put it into the basic system monitoring. So failing disks can get replaced.
And when the application administrator deploys the application on it, the special monitoring stuff is installed then.
Just because the needs and skills, which it takes to react on monitored issues, are very different.

User administration on multi-client-installations

Developing an application for a multi-client-installation isn’t only a technical or engineering quest, but also reveals some question, which affect administration and organisationial processes.

To ease administration, the user accounts in CQ are often organized in a hiearchy, so that users which are placed higher in the hierarchy, can administrate user which are lower in the hierarchy tree below them. Using this mechanism a administrator can easily delegate the administration of certain users to other users, which can also do adminstrative works for “their” users.

The problem arises when a user has to have rights in 2 applications within the same CQ instance and every application should have its own “application administrator” (a child node to the superuser user). Then this kind of administration is no longer possible, because it is impossible to model a hierarchy where neither application administrator user A has a parent or child relation to application administration user B nor A and B are placed in the hierarch higher than any user C.

I assume that creating accounts for different application but the same person isn’t feasible. That would be the solution which the easiest one from an engineering point of view, but this does contradict the ongoing move not to create for each application and each user a new user/password pair (single sign on).

This problem imposes the burden of user administration (e.g assigning users to groups, resetting passwords) to the superuser, because the superuser is the user, which is always (either by transition or directly) parent to any user. (A non-CQ-based solution would be to handle user related changes like password set/reset and group assignment outside of CQ and synchronize these data then into CQ, e.g. by using a directory system based on LDAP.)

ACLs, access to templates and workflows should be assigned only using groups and roles, because these can be created per application. So if an application currently is based on a user hierarchy and individual user rights it’s hard to add a new application using the same user.

So one must make sure, that all assignments are only based on groups and roles, which are created per application. Assigning individual rights to a single user isn’t the way to go.

Is CRX 1.4.2 production ready?

The contentbus technology was the standard storage backend till the CQ 4.x-series Although file-based storage wasn’t the great deal even in the late 1990s (mysql was already invented, postgres existed plus at least half a dozen enterprise database systems), Day choose to store the content objects in individual files, hidden by a abstraction layer. Of course it took some time of tuning and making experiences, but the contentbus proved to be a reliable storage which had the big point, that with an editor on the filesystem you can solve nearly all problems (we used more than once sed to fix our default.map)

But some points were still open:

Online backup isn’t possible. The documentation simply states: “Shutdown the CQ, copy your files, and startup again”. Although you can speed up the copy, if you replace with it with a snapshot on filesystem layer, but this need to restart doesn’t make it enterprise-ready. (Databases offer online-backup since at least a decade).
Contentbus requires a lot of filesystem I/O (mainly the system calls open and close). Having a lot of these operations slows down the processing. A small number of larger files would reduce this administrative overhead in the filesystem.
Memory usage of contentbus artifacts: Some artifacts like the default.map and zombie.map have in-memory data-structures, which grow as the underlying files grow (or vice-versa). The more content you have the more memory is used. Even if only a small part of this content is in active use. This doesn’t scale well.
The contentbus offers cluster support, but only with 2 nodes; with more nodes the overall performance will even degrade! According to the cluster documentation for CRX 1.4, Day tested CRX in a clustered setup with 6 nodes. If the performance loss is acceptable (that means, 6 nodes offer more performance than 5 nodes), this would be a real good solution to scale your authoring systems.

So we decided that’s time to evaluate if CRX would be at least as good as the contentbus. The TAR persistence manager adresses mainly the backup issue, we hope that we get some performance improvements as well.

So currently I’m doing a test setup of CQ 4.2.0 and CRX 1.4.2, for which Day offered (just in time :-)) a knowledge base article.

Take care of your selectors!

Recently I have shown two scenarios, where selectors can be used as a way to cache several different views of a single page. This allows one to avoid HTTP parameters quite often, reducing the load on your machines and speeding up your website.

Let’s assume that you have the already mentioned handle /etc/medialibrary/trafficjam.html and your templates support to display the image in 3 different sizes “preview”,”big” and “original”. So what does happen, if somebody chooses to request the URL “/etc/medialibrary/trafficjam.tiny.html”?

I checked some CQ-based websites and tested this behaviour. Just adding a dummy-selector. In most cases you get a proper page rendered, looking the same way as without the selector. So most templates (and also template developer) ignore the selector, if the that specific template isn’t expected to handle them. So it is good, isn’t it?

Well, in combination with the dispatcher cache it isn’t good. Because the dispatcher caches everything which is returned with an HTTP statuscode of 200 from CQ. So just adding a “foo”-selector will place another copy of the page to the dispatcher cache. This happens also with a “foo1” selector and so on. In the end the disk is full and the dispatcher cannot write any more files to the disk, but will forward every request to your CQ.

So, how can you bypass this problem? As said, the dispatcher caches only, when it receives an HTTP statuscode 200. So you need to add some code to your templates which always check the selectors. If this specific template doesn’t support any selector, fine. If called with a selector, don’t return a statuscode 200, but a 302 (permanent redirect) to the same page without any selectors or just a plain 404 (“file not found”); because calling this page with selectors isn’t a valid action and should never happen, such a statuscode is ok. The same applies when the templates supports a limited set of selectors (“preview”, “big” and “original” in the example above); just add them as a whitelist and if the given selector doesn’t match, return a 302 or 404 code.

So you don’t pollute your cache and still have the flexibility to use selectors. I think that this will outweigh the cost of adjusting your templates.

The output-cache

Day Communique offers another level of caching for content: the so-called output-cache. It caches already rendered pages and stores them in the CQ instance itself. The big advantage over the already mentioned dispatcher-cache is that the output-cache knows about the dependencies between handles because it’s part of CQ and can access these informations.

You can combine both dispatcher-cache and the output-cache. If no invalidation happens, the dispatcher-cache can answer most requests for static content and doesn’t bother the CQ with them. An invalidation may invalidate a larger part of the dispatcher-cache, but a lot of the requests, which are then forwarded to CQ, can be answered by the output-cache. So it sounds quite good to use both dispatcher-cache and output-cache to cache all content. In some cases it is.

But the output-cache has some drawbacks:

All dependencies are kept in memory (in the java heap). If you have a huge amount of content, the dependency tracking can eat up your memory. There is a hotfix available for CQ 4.2, which limits the cache-size, but it has some problems.
The rendered pages are kept on disk. So beside the content itself you need disk space for the rendered output.
Probably the most important point: The output-cache makes CQ to use only one CPU! Obviously there is a mutex within the output-cache code, which serializes all rendering requests when the results should be cached. I cannot say that for sure, but this is what I observed in CQ instances using the output-cache. After we disabled the output-cache — which means adding a “deny all” rule to its configuration — this phenomenon was gone and CQ used all available CPUs.

Of course you don’t need to configure the output-cache to cache all rendered pages. You may want to cache only the pages which are often requested and also often invalidated on the dispatcher-cache but weren’t changed at all.

Dispatcher cache content delivery and invalidation

In contrary to a CQ instance which can handle and resolve specific dependencies between handles, the CQ dispatcher cache works quite simple. It uses so-called “stat-files” to indicate wheter cached files are still valid or need to refetched from the configured CQ.

The basic principle is that the last modification of a stat-file is compared to the last modification of content file. If the stat-file has been modified after the content file, the content file is stale and needs to be refetched. This refetching includes an update of the last-modifcation date to the current date. Otherwise the cached file can be delivered as up-to-date content.

In detail

The cache mirrors the content structure of the content in CQ. Using the statement “\statfileslevel” in the dispatcher.any file you can specify the depth for which these stat-files are created. If the dispatcher receives an invalidation request, it calculates the directory in which the files are which should be invalidated. If the directory in which this file is doesn’t contain a stat-file the dispatcher goes up in the directory structure until a stat-file is found. Then this stat-file is touched setting the last modification time to now. (If there are also stat-files in directories below the one of the file which is invalidated, they are also invalidated.)

If a requested file is contained in the cache, the dispatcher looks for a stat-file. If there is no stat-file in the directory of the to-be-delivered file, the dispatcher checks recursively the higher directories for a stat-file until it finds one. Then the last-modification times are compared and acted accordingly (as described above).

For correctness: There isn’t any recursionin the code anymore, since with deep nested content structures it produced Stack-Overflows killing the delivering webserver process. If you encounter this problem update to latest available dispatcher, currently version 4.0.2.

Update (2009-01-26): I just fell over the presentation by Dominique Jaeggi he gave on Tech Summit 2007. He talked over performance tuning and covers the proper use of the dispatcher in depth and explains the proper configuration of the invalidation

Why use the dispatcher?

The dispatcher is a major part of every CQ installation. It allows one to cache data in front of the application in the webserver. Unlike an application server a webserver is designed to deliver files at a high speed. They are blazing fast when delivering static files. An application server (either the servlet engine provided by Day or any other one like Tomcat, jboss or Websphere and Bea Weblogic) is designed to be extensible and to run custom code. Application servers provide an infrastructure to run custom code which operates on request data and delivers a customized request. They does a lot more, but for the sake of simplicity I stop here.
Application servers don’t achieve the performance of a webserver when delivering files, it’s not their primary job and they aren’t optimized to deliver files from disk to the net as fast as possible.

As a content management system often produces static content elements, which do not change very often or not at all. Such content elements (may them be HTML pages, images or other media elements) should be placed on a webserver to profit from its ability to deliver files really fast. But it also should be possible to have control of such files from the content management system.

Here comes the CQ Dispatcher in the play. The dispatcher acts as a plugin to the webserver and allows one to have administrative control over files which are placed on the webserver. The dispatcher fetches on demand files from the CQ instance and invalidates them on a specific signal.
So you can get the best of both worlds: The flexibility of a content management system with control over all parts and the performance of static pages delivered by a webserver.