JCR queries with large result sets

TL;DR: If you expect large result sets, try to run that query asynchronously and not in a request; and definitely pay attention to the memory footprint.

JCR queries can be a tricky thing in AEM, especially when it comes to their performance. Over the years practices have emerged, with the most important of them being “always use an index”. You can find a comprehensive list of recommendations in the JCR Query cheat sheet for AEM.

There you can also find the recommendation to limit the size of the result set (it’s the last in the list); while that can definitely help if you need just 1 or a handful of results, this recommendation is void if you need to compute all results of a query. And that situation can get even worse if you know that this result set can be large (like thousands or even tens of thousands of results).

I have seen that often, when content maintenance processes were executed in the context of requests, which took many minutes in an on-prem setup, but then failed on AEM CS because of the hard limit of 60 seconds for requests.

Large result sets come with their own complexities:

Iterating through the entire result set requires ACL checking plus the proper conversion into JCR objects. That’s not for free.
As the query engine puts a (configurable) read limit to a query, it can have a result set of at maximum 100k nodes by default. This number is the best case, because any access to the repository to post-filter the result delivered by the Lucene index also counts towards that number. If you cross that limit, reading the result set will terminate with an exception.
The memory consumption: While the JCR queries provide an iterator to read the result set, the QueryBuilder API provides API which read the entire result set and return it as a list (SearchResult.getHit()). If this API is used, just the result set can consume a significant amount of heap.
And finally: what does the application do with the result set? Is it performing an operating with each result individually and then does not the single result anymore? Or does it read each result from the query, performs some calculations and stores them again in a list/array for the next step of processing. Assuming that you have 100k querybuilder Hits and 100k custom objects (potentially even referencing the Hit objects), that can easily lead to a memory consumption in the gigabytes.
And all that could happen in parallel.

In my experience all of these properties of large result sets mandate that you run such a query asynchronously, as it’s quite possible that this query takes tens of seconds (even minutes) to complete. Either run it as a Sling Job or using a custom Executor in the context of an OSGI service, but do not run them in the context of request, as in AEM CS this request has the big chance to time out.

The Explain Query tool

When there’s a topic which has been challenging forever in the AEM world, then it’s JCR queries and indexes. It can feel like an arcane science, where it’s quite easy to mess up and end up with a slow query. I learned it also the hard way, and a printout of the JCR query cheatsheet is always below my keyboard.

But there were some recent changes, which made the work with query performance easier. First, in AEM CS the Explain Query tool has been added, which is also available via the AEM Developer Console. It displays queries, slow queries, number of rows read, the used index, execution plan etc. But even with that tool alone it’s still hard to understand what makes a query performant or slow.

Last week there was a larger update to the AEM documentation (thanks a lot, Tom!), which added a detailed explanation of the Explain Query tool. Especially it drills down into the details of the query execution plan and how to interpret it.

With this information and the good examples given there you should be able to analyze the query plan of your queries and optimize the indexes and queries before you execute them the first time in production.

What’s the maximum size of a node in JCR/AEM?

An interesting question which comes up every now and then is: “Is there a limit how large a JCR node can get?”. And as always in IT, the answer is not that simple.

In this post I will answer that question and also outline why this limit is hardly a constraint in AEM development. Also I will show ways how you can design your application so that this limit is not a problem at all.

Continue reading →

Long running sessions and SegmentNotFoundExceptions

If you search this blog, you find one recurring theme over the years: The lifecycle of JCR sessions and Sling ResourceResolvers. That you should not keep them open for a long time. And that you definitely have to close them. But I never gave you an example what can happen if you don’t follow this recommendation. Until now.

These days I learned that was is actual problem which can arise because of it. And the problem is called “SegmentNotFoundException”.

In the past a SegmentNotFoundException was a clear indication of a corrupt JCR repository. The recommendation was always either to fix it or to restore from backup. Both operations are tedious, require downtimes and possibly also mean a loss of data. That’s probably also the reason why this specific problem is often taken for the sign of such a repository exception. So let’s systematically look at it.

The root cause

With AEM 6.4 the feature of “tail-compaction” was introduced, which is a version of the online compaction feature. It is less efficient but takes less time than the full compaction. By default in AEM the tail compaction runs daily and the online compaction once a week.

But from what I understood, this tail compaction has a problem with long running sessions, and it can happen, that tar files are compacted and removed, which are still referenced. That means, that it’s not really a on-disk corruption which needs to be fixed, but rather that some “old sessions” (read about MVCC in the previous post) are referencing data which is not there (anymore).

An unclosed session – a symbol photo (by engin akyurt on Unsplash)

Validate the symptoms

This problem I describe in this post happens under some special circumstances, which you should check first before you start the hunt for long-running sessions:

You get SegmentNotFoundExceptions (always with the same segment ID).
A repository check doesn’t find any inconsistency.
If you restart the instance, the error is gone, but appears again after some time (mostly at least a day).
You are running AEM 6.4 or AEM 6.5 (SP doesn’t seem to matter).

In the case I observed, only a single workflow step was affected, but not all the time and only after some time, which made me believe that it was related to the compaction. But it was very hard to track down the error, because the workflow step itself was complex, but safe.

The solution

Fix any long-running session in your application (unless you are registering an ObservationListener in there, which takes care of the refreshs by design). Really all. Use the JMX webconsole plugin and check the list of registered session mbeans every day on a production instance. Count them. Look at the timestamps when the session was opened.

In the case I observed, the long running session was in a different area of the application, but was working on the same data (user profiles) as the failing workflow step. But the 2 areas in the code were totally unrelated to each other, so that was the only way to track it down.

Final words

Some other notes, which I consider as important in this context:

When you encounter a SegmentNotFoundException, please always open a support ticket, just in case. If it’s a different issue than described here, it’s better if you have that ticket open already.
If you see exactly this issue, and changing your application code makes this problem go away, please also raise a support ticket. That bug should get fixed (even if long-running sessions are not recommended since years).
As mentioned, when you encounter this issue, it’s not a persisted corruption. Restarting will cause the issue not to appear for some time, but that should only buy you time to identify and fix the long running sessions.
And AEM as a Cloud Service is not affected by this problem, because neither Online Compaction nor Tail Compaction are used. Instead the Golden Master is offline compacted before cloning.

Prevent workflow launchers from starting a workflow

Workflow launchers are the standard way to trigger workflows based on changes in the content respository. The most prominent workflow which is triggered that way is the “Asset Update Workflow”, which does all the heavy lifting regarding asset processing. And it’s important to note that this workflow is executed on all changes to an asset itself, its renditions or on metadata.

But often this is not required. If you add more or custom meta data to an asset or even do it in a batch mode, you don’t want to this workflow to run at all; these metatadate changes are not relevant to assets themselves, but just to the way they should be handled in the specific context of your application.

The typical way to make the workflow not to start is to disable the workflow launcher (setting the “enabled” flag to “false”). But this is a global setting which affects all possible invocations, that means also the regular ingestion; and in that case the workflow has to run. So you need a way to specifically disable the workflow to start.

Fortunately there are a few ways how to achieve that, if you have the code under control, which performs the changes, and after which you don’t want the workflow to start again. This is key, because there is a feature available in the workflow launcher (sidenote: I just found that it has been documented; so it often makes sense to check documentation if there have been updates).

You can configure on the workflow launcher an exclusion property in the format “event-user-data:randomString”; this ignores all changes made by a JCR session which has a user-property “randomString” set.

How can you set that property? That’s quite easy:

Session session = ...; session.getWorkspace().getObservationManager().setUserData("randomString"); // do you work with the session session.save();

And by default the “Asset Update Workflow” is configured with “event-user-data:changedByWorkflowProcess”, so if your batch asset-operation sets the user-data to this string “changedByWorkflowProcess”, the “Asset Update Workflow” is not triggered anymore, without disabling the workflow launcher for it.

That’s it. And if you ever wanted to channel data from a saving session to the process which handles the observation events for it (the workflow launchers are just a very convenient way around the JCR Observation API): Just use event.getUserData().

AEM transaction size or “do a save every 1000 nodes”

An old rule of thumb, even on earlier versions of CQ5, is “when you do large repository operations, do a session.save() every 1000 nodes”. The justification for this typically, that this is the default of the Package Manager, and therefor it’s a kind of recommended approach. And to be honest, I don’t know the real reason for it, even though I work in the Day/Adobe ecosystem for quite some time.

But nevertheless, with Oak the situation has changed a bit. Limits are much more explicit, and this rule of “every 1000 nodes do a save” can be considered still as true statement. But let me give you some background on it, why this exists at all. And then let’s find out, if this rule is still safe to use.

In the JCR specification there is the concept of transient space. This transient space holds all activities performed on a session until an implicit or explicit save() of the session. So the transient space holds all temporary data of a transaction, and the save() is comparable to the final commit of a transaction.

This transient space is typically hold inside the java heap, so dealing with it is fast.

But by definition this transaction is not bound in terms of size. So technically you should be able to create sessions, which modify all nodes and every property of a repository of 2 TB size. This transient space does not fit into heap of a standard size (say: 12GB) any more. In order to support this behavior nevertheless, Oak starts to move this transient space entirely into the storage (TarMK, Mongo) if the transient space is getting too large (in the DocumentNodeStore language this is called a “persistent branch”, see the documentation of the DocumentNodeStore for some details on branches); then the size of the transaction is only limited by the amount of free storage on the persistance, but no longer by the size of the Java heap.

The limit is called update.limit and by default this 10k (up to and including Oak 1.4/AEM 6.2, 100k starting with Oak 1.6/AEM 6.3, see OAK-3036. But of course you can change this value using “-Doak.update.limit=40000”.

This value describes the amount of changes a transient space for a single session can hold before it is moved into the persistence. A change is a any change to a property or a node (adding/removing/modifying/reordering/…).

OK, that’s the theory, but what does this mean for you?

First, if the transient space is swapped to the persistence, the final session.save() will take much longer compared to a transient space in memory. Because to do the save, the transient space needs to be read from the persistence first (which typically includes at least disk I/O, in cases of MongoDB network I/O, which is even slower).

And second, when you add or change nodes, you typical deal with properties as well. So if you are on AEM 6.2 or older, you should check that you don’t do too much changes within a session, so you don’t hit this “10’000 changes” limit and get the performance penalty. If you have a reasonable content structure, the above mentioned rule of thumb of “do a save every 1000 nodes” goes into the very right direction.

But that’s often not good enough, because the updates of the synchronous Oak indexes count towards the 10’000 changes as well. You might know, that the synchronous indexes mirror the JCR tree, thus adding 1000 JCR nodes will also add 1000 oak nodes for the nodetype index. And that’s not the only synchronous index…

Thus increasing the update.limit to a higher number makes pretty much sense just to be on the safe side. But there is a drawback when you have such large limits: It’s the size of the transient space. Imagine you upload 1000 assets (1 MB each) into your repository in a single session, and you have the update.limit set to 100’000. The number of changes will not reach the update.limit, that’s unlikey. But your transient space will consume 1 GB of heap at least! Is your system designed and setup to handle this? Do you have enough free JVM heap?

Let’s conclude: The rule of thumb “do a save every 1000 nodes” might be a bit too optimistic on AEM 6.2 and older (with default values), but ok on AEM 6.3. But always keep the amount of transient space in mind. It can overflow your heap and debugging out-of-memory situations is not nice.

If you are interested in the inner working of Oak, look at this great piece of documentation. It covers a lot of lowlevel concepts, which are useful to know when you deal with the repository more often.

JCR Observation in clustered AEM instances

Clustering AEM got a bit different with the introduction of OAK. But with the enforcement of the MVCC model in Oak I also advise to revisit some patterns you might got used to. Because some code which worked with no apparent problem in AEM 5.x might cause problems now.

One thing I would check are the JCR Observation Listeners. Using JCR observation is a common way to react on changes in the repository and this is common pattern since CQ 5.0. So what’s the problem with that? The problem is that many JCR observation handlers are not written with clustering in mind.

Take the example that you need to react on changes in the repository and in turn modify something else. The usual approach is to have a service like this (omitting a lot of the boilerplate …)

public class MyListener implements EventListener {

 @Activate
 protected void activate() {
  ...
  ObservationManager om = session.getWorkspace().getObservationManager();
  om.addEventListener (this, 
   Event.NODE_ADDED,
   "/content/mysite",
   null,
   new String[]{"cq:Page"},
   true,
   true);
  ...
 }

 public onEvent (EventIterator events) {
  // iterate through the events and change something in the repository.
 }

}

This works very well in any non-clustered environment, because there is only a single event handler performing these changes. In clustered environments the situation is different, because now on each cluster node there is such a event handler active. And each one wants to perform the repository changes.
In that case you’ll see a lot of Oak exceptions (on all cluster nodes) which indicate that nodes have been modified externally (outside of the current session) and that a merge was not possible. This is because the changes happen in (quasi-) parallel, but not visible to the currently open sessions, thus causing these exceptions.

The only solution to this problem is to execute the EventListener only on a single node or to handle every event by exactly one event handler and not on all.

Handling every observation event on exactly handler is the elegant and scalable solution. The idea is to handle on every cluster node only the changes which happen on this cluster nodes („local events“). While the JCR API doesn’t have any notion of cluster and the Observation API does not give any information if a event is local or not, the Jackrabbit implementation (which Oak is using here) supports this through the JackrabbitObservationManager. As you can see in the following snippet, only the registration of the ObservationHandler changes, but not the handler itself.

public class MyScalableListener implements EventListener {

 @Activate
 protected void activate() {
  ...
  JackrabbitEventFilter ef = new JackrabbitEventFilter()
   .setAbsPath("/content/mysite")
   .setNodeTypes(new String[{"cq:Page"})
   .setEventTypes(Event.NODE_ADDED)
   .setIsDeep(true)
   .setNoExternal(true);
  JackrabbitObservationManager om = (JackrabbitObservationManager) session.getWorkspace().getObservationManager();
  om.addEventListener (this, ef);
  ...
 }

 public onEvent (EventIterator events) {
  // iterate through the events and change something in the repository.
 }
}

Through the Jackrabbit API extension you can register you EventListener to only handle local changes only and ignore any external ones, which are generated on another cluster nodes (using the setNoExternal(true) call). This is a scalable solution because the events handled at the location where they are generated, and no cluster nodes gets a bottleneck because of this.

So whenever you write an ObservationHandler and especially when you use a cluster, you should review your code and make sure, that you avoid concurrent access to the same resource. Of course there are many ways to have concurrent access even without clustering, but when you actually use clustering, the JCR observation handlers are the easiest piece of code to check and fix.