Slow deployments on AEM 6.4/6.5

A recent post on the AEM forums challenged me to look into an issue I observed myself but did not investigate further.

The observation is that during deployments maintenance tasks are stopped and started a lot, and this triggers a lot of other activities, including a lot of healtcheck executions. This slows down the deployment times and also pollutes the logfiles during deployments.

The problem is that the AEM Maintenance TaskScheduler is supposed to react on changes on some paths in the repository (where the configuration is stored), but unfortunately it also reacts on any change of ResourceProviders (and every Servlet is implemented as a single ResourceProvider). And because this causes a complete reload/restart of the maintenance tasks (and some healthchecks as well), it’s causing quite some delay.

But this behaviour is controlled via OSGI properties, which are missing by default, so we can add them on our own 🙂

Just create a OSGI configuration for com.adobe.granite.maintenance.impl.TaskScheduler and add a single multi-value property named “resource.change.types” with  the values “ADDED”, “CHANGED”, “REMOVED”.

(please also report this behavior via Daycare and refer to GRANITE-29609, so we hopefully get a fix for it, instead of applying this workaround).

Understanding the “Oak Repository Statistics” MBean

In the last releases AEM has been greatly enhanced to provide information which are suitable for health detection. Especially Oak provides a huge amout of MBeans which can be monitored. But sometimes they are a bit hard to understand. Based on some ongoing activities I digged through the “Oak Repository statistics” MBean and found it quite useful, even for some basic understanding and analysis.

I did this analysis and the screenshot on AEM 6.5, but this MBean is present at least since AEM 6.1 (probably even 6.0) and its content hasn’t changed much.

When you access this MBean the top of the page looks like this (this instance has just started):

Oak Repository Statistics MBean

There are a number of values collected, and presented for a number of times:

  • per second: The raw value in each second for the last minute.
  • per minute: The aggregated value on a minute basis for the last hour
  • per hour: The aggregated value on a hourly basis for the last day
  • per day: the aggregated value on a daily basis

The aggregation differs based on the type of the metric:

  • Gauge: This is a simple value, which is not further processed. When values of types must be aggregated, an average is calculated.
  • Counter: This a number, which can be accumulated. When values of type counter must be aggregated, they are summed up.
Attribute NameTypeDescription
SessionCountGaugeThe number of JCR sessions, which are currently open
SessionLoginGaugeThe number of sessions opened within that time
SessionReadCountCounterThe number of read operations in the JCR (over all sessions)
SessionReadDurationGaugeThe total time spent in read operations (nanoseconds)
SessionReadAverageGaugethe average duration for read operations (SessionReadDuration divided by the number of reads)
SessionWriteCountCounterThe number of write operations in the JCR (over all sessions). Be aware that session.refresh() is also counted as write operation!
SessionWriteDurationGaugeTotal time spent writing to sessions in nanoseconds
SessionWriteAverageGaugethe average duration for write operations (SessionWriteDuration divided by the number of writes)
QueryCountCounterThe number of JCR queries executed
QueryDurationGaugeThe total time spent in JCR query operations (miliseconds)
QueryAverageGaugethe average duration for queries (QueryDuration divided by the number of writes)
ObservationEventCountCounterThe number of observation events delivered to all listeners
ObservationEventDurationGaugeThe total time spent processing events by all observation listeners in nanoseconds
ObservationEventAverageGaugethe average duration spent processing observation events (ObservationEventDuration divided by the number of events)
ObservationQueueMaxLengthGaugeThe maximum length of the JCR Observation Queue; in newer Oak versions this queue does no longer exist, and then the value is -1

This measurement is done to limit the amount of data which needs to be stored. And this data is stored within the JVM inside the Oak bundles; that means that any restart of the JVM or restart of the Oak bundles will reset these values. If you want to persist these values you need to read them via JMX and store them.

Ok, what can you do with all this data? Well, it can help you to answer many questions. For example you can find out very easy, if you have a session leak.Because then numbers ion the SessionCount attribute always increase over time. It’s also interesting to find out what is happening within your system when it’s completely idle. Are there repository writes which you are not expecting? Queries every few seconds?

If you are investigating performance issues, or if you want to avoid them, you should have a look into this MBean.

Update: Fixed the only link in this post. Thanks Oswald for reporting!

How does Sling resolve an AEM Page to the correct resource type?

On the AEM forum there’s an interesting question:

Does this mean that all requests to the page node are internally redirected to _jcr_content for HTML and XML calls but not for other extensions ?

Without answering the question right here, it raises the question: How is a cq:Page node actually resolved? Or how does Sling know, that it should resolve a call to a cq:Page node to the resourcetype specified in its jcr:content node? Especially since the “cq:Page” does not have a “sling:resourceType” property at all?

A minimal page can look like this (JSON dump):

  "jcr:primaryType": "cq:Page",
  "jcr:createdBy": "admin",
  "jcr:created": "Mon Dec 03 2018 19:09:44 GMT+0100",
  "jcr:content": {
    "jcr:primaryType": "cq:PageContent",
    "jcr:createdBy": "admin",
    "cq:redirectTarget": "/content/we-retail/us/en",
    "jcr:created": "Mon Dec 03 2018 19:09:44 GMT+0100",
    "cq:lastModified": "Tue Feb 09 2016 00:05:48 GMT+0100",
    "sling:resourceType": "weretail/components/structure/page",
    "cq:allowedTemplates": ["/conf/we-retail/settings/wcm/templates/.*"]

A good tool to understand how a page is rendered is the “Recent Requests” tool available in the OSGI webconsole (/system/console/requests).

When I request /content/we-retail.html and check this request in the recent requests tool, I get this result (reduced to the relevant lines):

    1967 TIMER_END{65,ResourceResolution} URI=/content/we-retail.html resolves to Resource=JcrNodeResource, type=cq:Page, superType=null, path=/content/we-retail
1974 LOG Resource Path Info: SlingRequestPathInfo: path='/content/we-retail', selectorString='null', extension='html', suffix='null'
   1974 TIMER_START{ServletResolution}
   1976 TIMER_START{resolveServlet(/content/we-retail)}
   1990 TIMER_END{13,resolveServlet(/content/we-retail)} Using servlet /libs/cq/Page/Page.jsp
   1991 TIMER_END{17,ServletResolution} URI=/content/we-retail.html handled by Servlet=/libs/cq/Page/Page.jsp

The interesting part is here, that /content/we-retail resource has a type (= resource type) of “cq:Page”, and when we look down in the snippet, we see that it is resolved to /libs/cq/Page/Page.jsp; that means the resource type is actually “cq/Page”.

And yes, in case no resource type property is available, Sling’s first fallback strategy is to try the JCR nodetype as a resource type. And in case this fails as well, it goes back to the default servlets.

OK, now we have a point to investigate. The /libs/cq/Page/Page.jsp is very simple:

file="proxy.jsp" %>

And in the proxy.jsp there’s a call to the RequestDispatcher to include the jcr:content resource; and from there one the standard resolution starts which we all are used to.

And to get back to the forum question: If you look around in /libs/cq/Page, you can find more scripts to deal with extensions and also some selectors. And they all include only the call to the proxy.jsp. If your selector & extension combination do not work as expected, you might want to add an overlay there.

AEM transaction size or “do a save every 1000 nodes”

An old rule of thumb, even on earlier versions of CQ5, is “when you do large repository operations, do a every 1000 nodes”. The justification for this typically, that this is the default of the Package Manager, and therefor it’s a kind of recommended approach. And to be honest, I don’t know the real reason for it, even though I work in the Day/Adobe ecosystem for quite some time.

But nevertheless, with Oak the situation has changed a bit. Limits are much more explicit, and this rule of “every 1000 nodes do a save” can be considered still as true statement. But let me give you some background on it, why this exists at all. And then let’s find out, if this rule is still safe to use.

In the JCR specification there is the concept of transient space. This transient space holds all activities performed on a session until an implicit or explicit save() of the session. So the transient space holds all temporary data of a transaction, and the save() is comparable to the final commit of a transaction.

This transient space is typically hold inside the java heap, so dealing with it is fast.

But by definition this transaction is not bound in terms of size. So technically you should be able to create sessions, which modify all nodes and every property of a repository of 2 TB size.  This transient space does not fit into heap of a standard size (say: 12GB) any more. In order to support this behavior nevertheless, Oak starts to move this transient space entirely into the storage (TarMK, Mongo) if the transient space is getting too large (in the DocumentNodeStore language this is called a “persistent branch”, see the documentation of the DocumentNodeStore for some details on branches); then the size of the transaction is only limited by the amount of free storage on the persistance, but no longer by the size of the Java heap.

The limit is called update.limit and by default this 10k (up to and including Oak 1.4/AEM 6.2, 100k starting with Oak 1.6/AEM 6.3, see OAK-3036. But of course you can change this value using “-Doak.update.limit=40000”.

This value describes the amount of changes a transient space for a single session can hold before it is moved into the persistence. A change is a any change to a property or a node (adding/removing/modifying/reordering/…).

OK, that’s the theory, but what does this mean for you?

First, if the transient space is swapped to the persistence, the final will take much longer compared to a transient space in memory. Because to do the save, the transient space needs to be read from the persistence first (which typically includes at least disk I/O, in cases of MongoDB network I/O, which is even slower).

And second, when you add or change nodes, you typical deal with properties as well. So if you are on AEM 6.2 or older, you should check that you don’t do too much changes within a session, so you don’t hit this “10’000 changes” limit and get the performance penalty. If you have a reasonable content structure, the above mentioned rule of thumb of “do a save every 1000 nodes” goes into the very right direction.

But that’s often not good enough, because the updates of the synchronous Oak indexes count towards the 10’000 changes as well. You might know, that the synchronous indexes mirror the JCR tree, thus adding 1000 JCR nodes will also add 1000 oak nodes for the nodetype index. And that’s not the only synchronous index…

Thus increasing the update.limit to a higher number makes pretty much sense just to be on the safe side. But there is a drawback when you have such large limits: It’s the size of the transient space. Imagine you upload 1000 assets (1 MB each) into your repository in a single session, and you have the update.limit set to 100’000. The number of changes will not reach the update.limit, that’s unlikey. But your transient space will consume 1 GB of heap at least! Is your system designed and setup to handle this? Do you have enough free JVM heap?

Let’s conclude: The rule of thumb “do a save every 1000 nodes” might be a bit too optimistic on AEM 6.2 and older (with default values), but ok on AEM 6.3. But always keep the amount of transient space in mind. It can overflow your heap and debugging out-of-memory situations is not nice.

If you are interested in the inner working of Oak, look at this great piece of documentation. It covers a lot of lowlevel concepts, which are useful to know when you deal with the repository more often.

AEM coding best practice: No String operations on paths!

I recently needed to review project code in order to validate if it makes problems when upgrading from AEM 5.6 to AEM 6.x; so my focus wasn’t on the code in the first place, but on some other usual suspects (JCR queries etc). But having seen a few dozen classes I found a pattern, which I then found all over the code: the excessive use of String operations. With a special focus on string operations on repository paths.

For example something like this:

String[] segments = resource.getPath().split("/");
String settingsPath = "/" + StringUtils.join(segments,"/",0,2) + "/settings/jcr:content";
Resource settings = resourceResolver.get(settingsPath);
ValueMap vm = settings.adaptTo(ValueMap.class);
String language = vm.get("language");

(to read settings, which are stored in a dedicated page per site).

Typically it’s a weird mixture of String methods, the use of StringUtils classes plus some custom helpers, which do things with ResourceResolvers, Sessions and paths. Spread all over the codebase. Ah, and it lacks a lot of error checking (what if the settings page doesn’t contain the “language” property? adaptTo() can return “null”).

Sadly that problem not limited to this specific project code, I found it in many other projects as well.

Such a code structure is a clear sign for the lack of abstraction and guidance. There are no concepts available, which eliminate the need to operate on strings, but the developer is left with the only abstraction he has: The repository and CRXDE Lite’s view on it. He logs into the repository, looks at the structure and then decides how to mangle known pieces of information to get hold of the things he needs to access. If there’s noone which reviews the code and rejects such pieces, the pattern emerges and you can find it all over the codebase. Even if developers create utility classes for (normally every developer creates one on its own), it’s a bad approach, because these concepts are not designed (“just read the language from the settings page!”), but the design “just happens“; there is no focus on it, therefor quality is not enforced and error handling typically quite poor.

Another huge drawback of this approach: It’s very hard to change the content structure, because at many levels assumptions about the content structure are used, which are often not directly visible. For the example the constants “0” and “2” in the above code snippets determine the path prefix, but you cannot search for such definitions (even if they are defined as constant values).

If the proper abstraction would be provided, the above mentioned code could look like this:

String language = "en";
Settings settings = resource.adaptTo(Settings.class);
If (settings != null) {
  language = settings.getLanguage();

This approach is totally agnostic of paths and the fact that settings are stored inside a page. It hides this behind a well-defined abstraction. And if you ever need to change the content structure you know where you need to change your code: Only in the AdapterFactory.

So whenever you see code which uses String operations on paths: Rethink it. Twice. And then come up with some abstraction layer and refactor it. It will have a huge impact on the maintainability of your code base.

OSGI: static and dynamic references

OSGI as component model is one of the cores of AEM, as it allows to dynamically register and consume services offered by other parts of the system. It’s the central registry you can ask for all kind of services.

If you have some weeks of CQ experience as developer, you probably already know the mechanics how to access a product service. For example one of the most often used statements in a service (or component) is:

 SlingRepository repo;

which gives you access to the SlingRepository service, through which you can reach the JCR. Technically spoken, you build a static reference. So your service gets active only when this reference can be resolved. By this you can rely on the repository being available whenever your service is running. This is a constraint which is not a problem in many cases. Because it wouldn’t make sense for your service to run without the repository, and it also frees to permanently checking “repo” for being not null 🙂

Sometimes you don’t want to wait for a reference to be resolved (maybe breaking a dependency loop) or you can just deliver additional value if a certain (optional) service is available. In such cases you can make it a dynamic reference

SlingRepository repo;

Now there’s no hard dependency to the SlingRepository service; so your service might get active before the SlingRepository service is available, and therefor you need to handle the case that “repo” is null.

Per se this feature might have little importance to you, but combining it with other aspects makes it really powerful. More on that in the next post…

JCR Observation throttle

With JCR observation any JCR repository offers a really nice feature to react on any changes in the repository. Because the mechanism is quite powerful and easy to use, it has found many adopters which work with it.

If you have worked with that feature and also imported large amount of content, you might also encounter the problem that there is a delay between persisting a change in the repository and the time, when the observation event for this change is fired. This delay is mainly based on the number of registered observation handlers and the time these handlers require to process the event (the whole JCR observation events are handled by a single thread). If it takes 100ms to run through all handlers and you persist 10k changes in 2 minutes, it will pill up and take about 20 minutes until the queue is empty again.

This delay may harm your application and user experience, so it’s advisable to

  1. improve the processing speed of any JCR event handler
  2. keep the queue small.

Especially if you have jobs, which are not time critical but might cause a storm of events (e.g. data importers), you should be aware of this and add artificial pauses to your job, so the observation event queue will not get too large.

A simple tool to help has been added to CQ/AEM 5.6.1, it’s called JCR Observation throttle (see

It allows you to wait until all pending events in the observation queue have been delivered. This helps you to wait for quiet moments and then start your data import. But be careful, the delay might be very long, if other processes constantly interfere. 

Adding JMX-support

CQ5 (even in its latest incarnation CQ 5.4) has a rather poor support for monitoring. If you take a look at the system via the popular “jconsole” tool, you don’t get any useful mbeans, which can tell you anything about the system. Only some logging stuff.

If you decide to instrument your code and provide some information via JMX (that’s something I would recommend to everyone, who adds non-trivial services to CQ5), have a look at Apache Aries, especially at the JMX whiteboard. Deploy this bundle to your CQ5 and then just register your mbeans as services. Voila, that’s it. You don’t need to register and unregister your mbeans, as this is handled by the JMX whiteboard.

Sadly documentation is currently rather poor, but the sourcode isn’t that hard to understand. You can start with the initial patch in the Aries issue tracking.

Building custom CQ5 installation images

Very often one needs to setup a number of CQ5 installations with the same featuresets; e.g if you start with a bunch of new publishing instances or you need to update your development environments with a new set of hotfixes.

One way is to provide a detailled list of instructions plus the required files to the people responsible for it. It’s important to be consistent over all affected installations and environments, so you can remedy problems and issues because of missing fixes or wrong installation. But then a lot of manual work is included, which isn’t the thing IT people want to do.

I needed to provide several CQ5 installations in the last time. Because my standard installation recommendation consists of CQ 5.3 plus CRX 2.1 plus performancepack 30015 (using CQSE and the TarPM) at the moment, just deploying a CQ 5.3 the usual way isn’t sufficient. But on the other hand I don’t want to have the work of a manual installation of CRX 2.1 and the performancepack on top of a default CQ 5.3, both including restarts.

So I decided to build an image, which contains all these components, without the need for an restart, without fiddling around with the package manager, just by using some hidden features of CQ5.3 and CRX: the package installer and the flawless upgrade procedure of CRX 2.x (x={0,1}, will probably work also for later versions of CRX). You can find the documentation of the upgrade process also on the official documentation site.

1.) Unpack a plain CQ 5.3:

$ cd cq530
$ java -jar cq-wcm-quickstart-author-5.3.0.jar -unpack

2.) Get CRX 2.1 and unpack it:

$ cd crx21
$ java -jar crx- -unpack

3.) Copy the CRX webapplication file of CRX 2.1 into the unpacked CQ 5.3 installation:

$ cp crx21/crx-quickstart/server/webapps/crx-explorer_crx.war cq530/crx-quickstart/server/webapps

4.) Remove the CRXDE webapplication of CQ 5.3, as it is no longer needed for CRX 2.1

$ cd cq530/crx-quickstart/server/webapps
$ rm crx-de_crxde.war

5.) Edit also the server.xml, and remove the crxde webapp

6.) Define an order, in which the packages are deployed to CRX; as the packages are deployed in the order, they are listed by default in a shell, I define an order by explictly naming the files like “01_cq-content-5.3.jar”, “” and so. The files must be placed in the cq530/crx-quickstart/repository/install folder.

Make sure that the original “cq_content-5.3.jar” is deployed as first package, as it contains the WCM code. But then you can place there any CQ package you want: hotfixes, custom application code, initial content etc.

$ cd cq530/crx-quickstart/repository/install
$ cp cq-content-5.3.jar 01_cq-content-5.3.jar
$ cp
$ cp .........../

7.) If you want to use the CRXDE, you should download the file from Day PackageShare and copy it also into the install directory:

$ cp ......../ 05_cq530/crx/repository/install

8.) For convenience you can place now your file next in the toplevel directory of your installation, the result should be something like this:

$ ls -la
-rw-r--r--   1 jorghoh  staff  233110810  6 Aug 09:38 cq-wcm-quickstart-author-5.3.0.jar
drwxr-xr-x  10 jorghoh  staff        340  7 Okt 17:55 crx-quickstart
-rw-r--r--   1 jorghoh  staff        217  6 Aug 09:39

If you don’t want to deliver the license file with that image, you can omitt it; if the instance is started the first time, it is asked then.

9.) Now all parts are in place; so you can create an image file (tar file) and distribute it all over your environments:

$ cd cq530/..
$ tar -cf cq530-crx21-author-image.tar cq530/*

(if you rename the cq-wcm-quickstart-author-5.3.0.jar file to cq-wcm-quickstart-publish-5.3.0.jar, you have an image for a publish instance.)

Just unpack it and use your usual startup mechanisms (“start.bat” or “start”), and the framework will startup as usually, create a repository and also deploy all packages in the install folder directly to it. If you encounter problems, you may check with the Felix console, if all bundles are started.

Now you have an image, which you can copy and uncompress everywhere, even the plattform (Unix, Windows) doesn’t matter, as all is only dependent on the features of CRX Launchpad and CRX itself. A setup of a new emtpy instance, independent of the number of installed packages and hotfixes, can now be done within 2 minutes and can be fully automated.

Bootstrapping the CQ java process

In this article I will cover one special “feature” of the unix variant of CQ, which is the start script “serverctl”; its job is to perform the start and stop of the CQ java process. For Windows there is the server.bat version, which is kind of straight forward. The serverctl script is more complex and historically grown. I try to explain the basics of the CQ process handling, the handling of the java arguments and will show some ways to cope with the open-files problem. If you use CQ without the prepackaged CQSE, you don’t use this script, therefor this posting may not be interesting to you.

The first half of the script has actually no influence on the whole process handling, it is just preseeding parameters and parses the command line arguments. At about line 350 the interesting things start. In the default the CQ process is starting in the background; this start is triggered by the psmon process (actually the script starts a new incarnation of itself with the additional parameter “psmon”; in the process table it is visible with this “psmon” parameter, so I will just call it “psmon-process”).

This process performs the following important things:

  • Register a trap: when the TERM-signal is received it will remove the PID-files of its own process and the PID file of the Java process.
  • Start the terminator process (again a new instance of the process with the “terminator” parameter”) and attach its stdout filedescriptor to the stdin of the bgstart-process (same procedure here). This actually happens in a endless loop.

The terminator process is actually the another instance of the script. It also registers a trap (reacting on some signals). The main action in this trap-handler is that the string “QUIT” is written to stdout.

The bgstart process is another serverctl-instance, which creates the file with its own PID and then replaces itself completly by the final java process. Because the stdout filehandle of the terminator process is connected to the stdin-process of the bgstart process, it is inherited to the java process; so the terminator-process writes its “QUIT” to the stdin of the Java process!

Ok, the startup is quite clear now, so why the hell do we actually need 3 processes, 2 shell scripts and the java process? Well, let’s consider how the whole thing is shut down properly (and killing the Java process is not a good option).

The whole thing starts with the “stop” option (about line 570). First the TERM-signal is sent to the psmon-process, which then removes its own PID file and the file of the java process. Then the TERM-signal is sent to the terminator process, which writes the “QUIT” string to the java process. This string obviously tells the servlet engine to start a shutdown. Then the stop process waits up to 20 second and checks if the file vanishes. If after these 20 seconds the process is still there, it sends a TERM-signal to the java process (and assumes that this will do the rest and bring it down). But killing the java process isn’t really the friendly way, because the process will terminate immediately, leaving the shutdown process unfinished and the whole CQ/CRX in a inconsistent way. During next startup CRX will usually complain about a unclear shutdown; in most times you need to remove .lock files first before it really starts up properly.

Ok, finally the individual jobs of each process:

  • psmon: Restart in case of crashes
  • terminator: When issued, write “QUIT” to the java process
  • java process: actually do the work.

And for now a few tips:

Especially for large instances 20 seconds are often not enough to properly shutdown the whole CQ system. You want to increase this time when your system does not properly startup because either CRX performs some repair actions on startup; or it just refuses to startup because some .lock files are present. Then change the number in line 591 to something like this (I replaced the value 20 seconds by 300):

while [ -f "$CQ_LOGDIR/" ] && [ $COUNTER -lt 300 ]; do
  printf "."
  COUNTER=`expr $COUNTER + 1`
  sleep 1

This causes the script to wait up to 5 minutes, which should be enough for every CQ to shutdown. But if there are other problems, you have to wait 5 minutes until finally the kill happens.

When the java process was actually killed, it often leaves back some files in an inconsistent state. To stabilize the restart behaviour you may decide that you don’t want CQ to complain and stop during the regular startup; you just want CQ back in action asap and immediately start the recovery. You can add then the following lines to line 476 (before the “info” statement):

# remove all .lock files of a CQ crash/process kill
find $CQ_CONTEXT -name ".lock" | while read LOCKNAME; do
  warn "remove stale lock file $LOCKNAME"

Do this on your own risk, because you will never get to know if lockfiles have been left until you see CQ rebuilding its search and/or index files. But, a simple restart will your system bring back up (besides the case, when a crash killed things which cannot be recovered). With this proposed solution you have some data in the startup.log file, if (and which) stale lock files have been removed.

Another problem which often arises, is the number of open files. CQ5 and CRX as repository usually have many files open, a default installation has about 300 open files immediately after startup. If the number of requests increases and your repository grows, this number will grow too. At some point you will need to increase the maximum number of open files (in Unix speak: the value of ulimit -n).
By default this value is set to 1024 (CQ 5.2 and 5.2.1) by the serverctl script (until it is overriden by the value of CQ_MAX_OPEN_FILES in the wrapper script). Increasing this value by adjusting /etc/security/limits.conf (Redhat/Fedora) or via any other OS-preferred way does not help, the serverctl script always overrides this value. Applying this patch will fix this behaviour (to my readers from within Day: already reported in Bugzilla). A small patch for the “serverctl status” command will also print the configured value plus the current number of open files for the CQ process (also in Bugzilla).