Long running sessions and SegmentNotFoundExceptions

If you search this blog, you find one recurring theme over the years: The lifecycle of JCR sessions and Sling ResourceResolvers. That you should not keep them open for a long time. And that you definitely have to close them. But I never gave you an example what can happen if you don’t follow this recommendation. Until now.

These days I learned that was is actual problem which can arise because of it. And the problem is called “SegmentNotFoundException”.

In the past a SegmentNotFoundException was a clear indication of a corrupt JCR repository. The recommendation was always either to fix it or to restore from backup. Both operations are tedious, require downtimes and possibly also mean a loss of data. That’s probably also the reason why this specific problem is often taken for the sign of such a repository exception. So let’s systematically look at it.

The root cause

With AEM 6.4 the feature of “tail-compaction” was introduced, which is a version of the online compaction feature. It is less efficient but takes less time than the full compaction. By default in AEM the tail compaction runs daily and the online compaction once a week.

But from what I understood, this tail compaction has a problem with long running sessions, and it can happen, that tar files are compacted and removed, which are still referenced. That means, that it’s not really a on-disk corruption which needs to be fixed, but rather that some “old sessions” (read about MVCC in the previous post) are referencing data which is not there (anymore).

An unclosed session – a symbol photo (by engin akyurt on Unsplash)

Validate the symptoms

This problem I describe in this post happens under some special circumstances, which you should check first before you start the hunt for long-running sessions:

You get SegmentNotFoundExceptions (always with the same segment ID).
A repository check doesn’t find any inconsistency.
If you restart the instance, the error is gone, but appears again after some time (mostly at least a day).
You are running AEM 6.4 or AEM 6.5 (SP doesn’t seem to matter).

In the case I observed, only a single workflow step was affected, but not all the time and only after some time, which made me believe that it was related to the compaction. But it was very hard to track down the error, because the workflow step itself was complex, but safe.

The solution

Fix any long-running session in your application (unless you are registering an ObservationListener in there, which takes care of the refreshs by design). Really all. Use the JMX webconsole plugin and check the list of registered session mbeans every day on a production instance. Count them. Look at the timestamps when the session was opened.

In the case I observed, the long running session was in a different area of the application, but was working on the same data (user profiles) as the failing workflow step. But the 2 areas in the code were totally unrelated to each other, so that was the only way to track it down.

Final words

Some other notes, which I consider as important in this context:

When you encounter a SegmentNotFoundException, please always open a support ticket, just in case. If it’s a different issue than described here, it’s better if you have that ticket open already.
If you see exactly this issue, and changing your application code makes this problem go away, please also raise a support ticket. That bug should get fixed (even if long-running sessions are not recommended since years).
As mentioned, when you encounter this issue, it’s not a persisted corruption. Restarting will cause the issue not to appear for some time, but that should only buy you time to identify and fix the long running sessions.
And AEM as a Cloud Service is not affected by this problem, because neither Online Compaction nor Tail Compaction are used. Instead the Golden Master is offline compacted before cloning.

2 thoughts on “Long running sessions and SegmentNotFoundExceptions”

Sam says:

September 14, 2020 at 19:57

Thanks for this useful post.
Sorry I could not understand why java garbage collection framework does not release the memory automatically? I know an object will not be garbage collected if it has a reference.
If we take this code :
ResourceResolver resourceResolver = resourceResolverFactory.getServiceResourceResolver(null);
resourceResolver.getResource(path);
Here resourceResolver object is used to just to get the resource. And after the end of function execution there is no more reference present for resourceResolver object. So I do not understand why memory cannot be free by the java garbage collector? What reference it still holds? Is it because resourceResolver.getResource(path) holds the reference to segmentstore which cannot be released even after executing this line of code?
1. Jörg says:
  
  September 15, 2020 at 20:39
  
  HI Sam,
  
  The problem here is that the relation between the ResourceResolver (or more precisely: the JCR session encapsulated in the ResourceResolver) and the JCR repository is not uni-directional (that means only from the Session to the JCR), but bi-directional (so the JCR also knows all Sessions). So this object cannot get collected unless the reference from JCR to the Session is released, which happens with the ResourceResolver.close() in our case.

Comments are closed.