Detecting JCR session leaks

A problem I encounter every now and then are leaking JCR sessions; that means that JCR sessions are opened, but never closed, but just abandoned. Like Files, JCR sessions need to be closed, otherwise their memory is not freed and they cannot be garbage collected by the JVM. Depending on the number of sessions you leave in that state this can lead to serious memory problems, ultimately leading to a crash of the JVM because of an OutOfMemory situation.

(And just to be on the safe side: In AEM ootb all ResourceResolvers use a JCR session internally; that means whatever I just said about JCR sessions applies the same way to Sling ResourceResolvers.)

I dealt with this topic already a few times (and always recommended to close the JCR sessions), but today I want to focus how you can easily find out if you are affected by this problem.

We use the fact that for every open session an mbean is registered. Whenever you see such a statement in your log:

14.08.2018 00:00:05.107 *INFO* [oak-repository-executor-1] com.adobe.granite.repository Service [80622, [org.apache.jackrabbit.oak.api.jmx.SessionMBean]] ServiceEvent REGISTERED

That’s says that an mbean service is registered for a JCR session; thus a JCR session has been opened. And of course there’s a corresponding message for unregistering:

14.08.2018 12:02:54.379 *INFO* [Apache Sling Resource Resolver Finalizer Thread] com.adobe.granite.repository Service [239851, [org.apache.jackrabbit.oak.api.jmx.SessionMBean]] ServiceEvent UNREGISTERING

So it’s very easy to find out if you don’t have a memory leak because of leaking JCR sessions: The number of log statements for registration of these mbeans must match the number of log statements for unregistration.

In many cases you probably don’t have exact matches. But that’s not a big problem if you consider:

  • On AEM startup a lot of sessions are opened and JCR observation listeners are registered to them. That means that a logfile with AEM starts and stops (and the number of starts do not match the number of stops) it’s very likely that these numbers do not match. Not a problem.
  • The registration (and also the unregistration) of these mbeans often happens in batches; if this happen during logfile rotation, you might have an imbalance, too. Again, not per se a problem.

It’s getting a problem, if the number of sessions opened is always bigger than the number of sessions closed over the course of a few days.

$ grep 'org.apache.jackrabbit.oak.api.jmx.SessionMBean' error.log | grep "ServiceEvent REGISTERED" | wc -l
212123
$ grep 'org.apache.jackrabbit.oak.api.jmx.SessionMBean' error.log | grep "ServiceEvent UNREGISTERING" | wc -l
1610
$

Here I just have the log data of a single day, and it’s very obvious, that there is a problem, as around 220k sessions are opened but never closed. On a single day!

To estimate the effect of this, we need to consider that for every of these log statements these objects are retained:

  • A JCR session (plus objects it reaches, and depending on the activities happening in this session it might also include any pending change, which will never going to be persisted)
  • A Mbean (referencing this session)

So if we assume that 1kb of memory is associated with every leaking session (and that’s probably an very optimistic assumption), this would mean that the system above would loose around 220M of heap memory every day. This system probably requires a restart every few days.

How can we find out what is causing this memory leak? Here it helps, that Oak stores the stack trace when opening sesions as part of the session object. Since around Oak 1.4 it’s only done if the number of open sessions exceeds 1000; you can tune this value with the system property “oak.sessionStats.initStackTraceThreshold”; set it to the appropriate value. This is a great help to find out where the session is opened.

And then go to /system/console/jmx, check for the “SessionStatistics” mbeans (typically quite at the bottom of the list) and select on the most recent ones (they have the openening date already in the name)

session information in the mbean view
session information in the mbean view

And then you can find in the “initStackTrace” the trace where this session has been opened:

Stacktrace of an open JCR session
Stacktrace of an open JCR session

With the information at hand where the session has been opened it should be obvious for you to find the right spot where to close the session.
If you spot a place where a session is opened in AEM product code but never closed, please check that with Adobe support. But be aware, that during system startup sessions are opened and will stay open while the system is running. That’s not a problem at all, and please do not report them!

It’s only a problem if you have a at least a few hundreds session open with the very same stack trace, that’s a good indication of such a “leaking session” problem.

A good followup reading on AEM HelpX pages with some details how you can fix it.

OSGI: static and dynamic references

OSGI as component model is one of the cores of AEM, as it allows to dynamically register and consume services offered by other parts of the system. It’s the central registry you can ask for all kind of services.

If you have some weeks of CQ experience as developer, you probably already know the mechanics how to access a product service. For example one of the most often used statements in a service (or component) is:

@Reference
 SlingRepository repo;

which gives you access to the SlingRepository service, through which you can reach the JCR. Technically spoken, you build a static reference. So your service gets active only when this reference can be resolved. By this you can rely on the repository being available whenever your service is running. This is a constraint which is not a problem in many cases. Because it wouldn’t make sense for your service to run without the repository, and it also frees to permanently checking “repo” for being not null 🙂

Sometimes you don’t want to wait for a reference to be resolved (maybe breaking a dependency loop) or you can just deliver additional value if a certain (optional) service is available. In such cases you can make it a dynamic reference

@Reference(policy=ReferencePolicy.DYNAMIC)
SlingRepository repo;

Now there’s no hard dependency to the SlingRepository service; so your service might get active before the SlingRepository service is available, and therefor you need to handle the case that “repo” is null.

Per se this feature might have little importance to you, but combining it with other aspects makes it really powerful. More on that in the next post…

Ways to access your content with JCR (part 2): Performance aspects

In the previous post I described ways how you can access your data in JCR. I also showed, that the performance of these ways is different.

  • For the direct lookup of a node the complexity depends on the number of path elements, which need to be traversed from the root node to that node. Also the number of child nodes on each of these levels has an impact. But in general this lookup is pretty fast.
  • If you just iterate through child nodes (using node.getChildren()), it’s even faster, the lookup complexity is constant.
  • The JCR search as third approach no general estimation can be given, it depends too much on the query.

First, the JCR query consists of 2 parts: An index lookup and operations on the JCR bundles.

Note: Of course you can build queries, where an index lookup is not required and might be optimized by the query engine; for example “//jcr:root/content/geometrixx/*” would return all nodes below /content/geometrixx, but building such queries isn’t useful at all, and I consider them as a mis-use of JCR queries.

This combination is usually in such a way, that the index lookup produces a set of possible results, which are then filtered by the means of JCR, e.g. by applying path constraints or node type restrictions. In every case the ACLs taken into account.

Let’s consider this simple example:


/jcr:root/content/geometrixx/en//*[jcr:contains(., 'support')]

First, it looks up all properties for the search term “support”. As the backing system for JCR search is Apache Lucene, and Lucene is implemented as inverted index, direct lookups like this are extremely efficient.
Then for all results the path is calculated. This means, that for each result item the parent is lookup recursively until the root node. In that process the ACL checks are performed.

As soon as the query gets complicated and Lucene delivers many results (for example because you are looking for wildcards) or you do complex JCR-based operations in the query, this isn’t that easy and performant any more. The more nodes you need to load to execute a query (and for all path and ACLs evaluations you need to load the bundle from disk to your BundleCache) the more time it takes.

But if you traverse a subtree with node.getChildren() only these bundles are loaded to the BundleCache for evaluation.

So in many cases, especially when you need to search a small subtree for a specific node, it’s more efficient to manually traverse the tree and search for the node(s) than to use JCR search. This means, you use the other 2 approaches listed above. You might not be used to it when you worked with a relational database for years, but it is a very feasible way with possibilties of huge performance benefits.
So, give it a try. But don’t expect differences on your developer machine with a blazing fast SSD and 1 gigabyte repository size. Test it on your production-size repository!

CQ coding patterns: Sling vs JCR (part 3)

In the last 2 postings (part 1, part 2) of the “Sling vs JCR” shootout I only handled read cases, but as Ryan D Lunka pointed out in the comments of on part 1, writing to the repository is the harder part. I agree on that. While on reading you often can streamline the process of reading nodes and properties and ignore some bits and pieces (e.g. mixins, nodeTypes), on the write part that’s a very important information. Because on the write part you define the layout and types, so you need to have full control of it. And the JCR API is very good in giving you control over this, so there’s not much room for simplification for a Sling solution.

Indeed, in the past there was no “Resource.save()” method or something like this. Essentially the resourceResolver (as a somehow equivalent of the JCR session) was only designed for retrieving and reading information, but not for writing. There are some front ends like the Sling POST servlet, which access directly the JCR repository to persist changes. But these are only workarounds for the fact, that there is no real write support on a Sling Resource level.

In the last months there was some activity in that area. So there is a wiki page for a design discussion about supporting CRUD operations within Sling, and code has been already submitted. According to Carsten Ziegler, the submitter of this Sling issue, this feature should be available for the upcoming version of CQ. So let’s see how this will really look like.

But till then you have 2 options, to get your data into the CRX repository:

  • Use the SlingPostServlet: Especially if you want to create or change data in the repository from remote, this servlet is your guy. It is flexible enough to cater most of normal use cases. For example most (if not all) of the CQ dialogs and actions use this SlingPostServlet to persist their data in the repository.
  • If that’s not sufficient for you, or you need to write structures directly from within code deployed within CQ, use JCR.

So here Sling currently does not bring any benefit, if your only job is to save data from within OSGI to CQ. If many cases the Default POST Servlet might be suffcient when you need to update data in CRX from external. And let’s see what CQ 5.6 brings with it.

CQ coding patterns: Sling vs JCR (part 2)

In the last posting I showed the benefits of Sling regarding resource access over plain JCR. But not only in resource access both frameworks offer similar functionality, but also in the important area of listening to changes in the repository. So today I want to compare JCR observation to Sling eventing.

JCR observation is a part of the JCR specification and is a very easy way to listen for changes in the repository.

@component (immediate=true, metatype=false)
@service
class Listener implements ObservationListener {

  @Reference
  SlingRepository repo;

  Session session;
  Logger log = LoggerFactory.getLogger (Listener.class);

  @activate
  protected void activate () {
    try {
      Session adminSession = repo.loginAdministrative(null);
      session = adminSession.impersonate (new SimpleCredentials("author",new char[0]));
      adminSession.logout();
      adminSession = null;
      session.getObservationManager.addEventListener( this, // listener
        NODE_CREATED|NODE_DELETED|NODE_MOVED, // eventTypes
        "/", // absPath
        true, // isDeep
        null, // uuid
        null, //nodeTypeNames
        true // noLocal
      );
    } catch (RepositoryException e) {
      log.error ("Error while registering observation", e);
    }
  }

  @deactivate
  protected void deactivate() {
    session.getObservationManager.removeListener(this);
    session.logout();
    session = null:
  }

  private handleEvents (Events events) {
    while (events.hasNext()) {
      Event e = events.next();
      … // do here your event handling
     }
  }
}

In JCR the creation of an observation listener is straight forward, also the event listening. The observation process is tied to the lifetime of the session, which is started and stopped at activation/deactivation of this sling service. This kind of implementation is a common pattern.
Note: Be aware of the implications of using an admin session!

  • You can read and write everywhere in the repository
  • You work with elevated rights you probably never need.

So try to avoid an adminSession. In th example above is use a session owned by the user “author” for this via impersonation.

A sling implementation of the same could look like this:

@Component (immediate = true)
@Service()
@Property (name = "event.topics", value = "/org/apache/sling/api/resource/Resource/*")
class Listener implements EventHandler {

  public void handleEvent (Event event) {
    // handle
  }
}

You see, that we don’t have to care about registering a listener and managing a session. Instead we just subscribe to some events emitted by the Sling Eventing framework. This framework is essentially implementing the OSGI event specification, and therefor you can also subscribing the very same way to various other event topics. You can check the “Event” tab in the Felix console, where you can see a list of events, which just happened in your CQ.

Some example for topics:

One disadvantage of Sling eventing is, that you cannot restrict the resource events to a certain path or user. Instead Sling registers with an admin session and publishes all observation events starting at the root node of the repository (/). You should filter quite early in your handleEvent routine for the events you are really interested in.

But: With these events you don’t geld hold of a resource. You always have you create them by this pattern:

if (event.getProperty(SlingConstants.PROPERTY_PATH) != null) {
  Resource resource = resourceResolver.getResource(event.getPath()); 
}

Also there you need to take care, that you’re using an appropriate ResourceResolver (try to avoid an admin resource resolver as well).

Also JCR observation has some benefits, if you want or need to limit the events you receive more granularly, for example by listening to changes of specific nodeTypes only or mixins.

So, as a conclusion: If you need to restrict the amount of repository events you want to process directly on API level, JCR observation is the right thing for you. Otherwise go for Sling Eventing, as it offers you also the possibility to receive many more types of events. For non-repository events it’s the only way to listen for them.