Understanding AEM request processing using the OSGI “Recent Request” console

During some recent work on performance improvements in request processing I used a tool, which is part of AEM for a very long time now; I cannot recall a time when it was NOT there. It’s very simple, but nevertheless powerful and it can help you to understand the processing of requests in AEM much better.

I am talking about the “Recent Requests Console” in the OSGI webconsole, which is a gem in the “AEM performance tuning” toolbox.

In this blog post I use this tool to explain the details of the request rendering process of AEM. You can find the detailed description of this process in the pages linked from this page (Sling documentation).

Screenshot “Recent requests”

With this Recent Requests screen (goto /system/console/requests) you can drill down into the rendering process of the last 20 requests handled by this AEM instance; these are listed at the top of the screen. Be aware that if you have a lot of concurrent requests you might often miss the request you are looking for, so if you really rely on it, you should increase the number of requests which are retained. This can be done via the OSGI configuration of the Sling Main Servlet.

When you have opened a request, you will see a huge number of single log entries. Each log entry contains as first element a timestamp (in microseconds, 1000 microseconds = 1 millisecond) relative to the start of the request. With this information you can easily calculate how much time passed between 2 entries.

And each request has a typical structure, so let’s go through it using the AEM Start page (/aem/start.html). So just use a different browser window and request that page. Then check back on the “Recent requests console” and select the “start.html”.
In the following I will go through the lines, starting from the top.

      0 TIMER_START{Request Processing}
      1 COMMENT timer_end format is {<elapsed microseconds>,<timer name>} <optional message>
     13 LOG Method=GET, PathInfo=null
     17 TIMER_START{handleSecurity}
   2599 TIMER_END{2577,handleSecurity} authenticator org.apache.sling.auth.core.impl.SlingAuthenticator@5838b613 returns true

This is a standard header for each request. We can see here that the authentication took 2599 microseconds.

   2981 TIMER_START{ResourceResolution}
   4915 TIMER_END{1932,ResourceResolution} URI=/aem/start.html resolves to Resource=JcrNodeResource, type=granite/ui/components/shell/page, superType=null, path=/libs/granite/ui/content/shell/start
   4922 LOG Resource Path Info: SlingRequestPathInfo: path='granite/ui/components/shell/page', selectorString='null', extension='html', suffix='null'

Here we see the 2 log lines for a the resolving process of a resourcetype. It took 1932 microseconds to map the request “/aem/start.html” to the resourcetype “granite/core/components/login” with the path being /libs/granite/ui/content/shell/start. Additionally we see information about the selector, extension and suffix elements.

   4923 TIMER_START{ServletResolution}
   4925 TIMER_START{resolveServlet(/libs/granite/ui/content/shell/start)}
   4941 TIMER_END{14,resolveServlet(/libs/granite/ui/content/shell/start)} Using servlet BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp)
   4945 TIMER_END{21,ServletResolution} URI=/aem/start.html handled by Servlet=BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp)

That’s a nested servlet resolution, which takes 14 respective 21 microseconds.  Till now that’s mostly standard and hard to influence performance-wise. But it already gives you a lot information, especially regarding the resourcetype which is managing the complete response processing.

   4948 LOG Applying Requestfilters
   4952 LOG Calling filter: com.adobe.granite.resourceresolverhelper.impl.ResourceResolverHelperImpl
   4958 LOG Calling filter: org.apache.sling.security.impl.ContentDispositionFilter
   4961 LOG Calling filter: com.adobe.granite.csrf.impl.CSRFFilter
   4966 LOG Calling filter: org.apache.sling.i18n.impl.I18NFilter
   4970 LOG Calling filter: com.adobe.granite.httpcache.impl.InnerCacheFilter
   4979 LOG Calling filter: org.apache.sling.rewriter.impl.RewriterFilter
   4982 LOG Calling filter: com.adobe.cq.history.impl.HistoryRequestFilter
   7870 LOG Calling filter: com.day.cq.wcm.core.impl.WCMRequestFilter
   7908 LOG Calling filter: com.adobe.cq.wcm.core.components.internal.servlets.CoreFormHandlingServlet
   7912 LOG Calling filter: com.adobe.granite.optout.impl.OptOutFilter
   7921 LOG Calling filter: com.day.cq.wcm.foundation.forms.impl.FormsHandlingServlet
   7932 LOG Calling filter: com.day.cq.dam.core.impl.servlet.DisableLegacyServletFilter
   7935 LOG Calling filter: org.apache.sling.engine.impl.debug.RequestProgressTrackerLogFilter
   7938 LOG Calling filter: com.day.cq.wcm.mobile.core.impl.redirect.RedirectFilter
   7940 LOG Calling filter: com.day.cq.wcm.core.impl.AuthoringUIModeServiceImpl
   8185 LOG Calling filter: com.adobe.granite.rest.assets.impl.AssetContentDispositionFilter
   8201 LOG Calling filter: com.adobe.granite.requests.logging.impl.RequestLoggerImpl
   8212 LOG Calling filter: com.adobe.granite.rest.impl.servlet.ApiResourceFilter
   8302 LOG Calling filter: com.day.cq.dam.core.impl.servlet.ActivityRecordHandler
   8321 LOG Calling filter: com.day.cq.wcm.core.impl.warp.TimeWarpFilter
   8328 LOG Calling filter: com.day.cq.dam.core.impl.assetlinkshare.AdhocAssetShareAuthHandler

These are all request-level filters, which are executed just once per request.

And now the interesting part starts: the rendering of the page itself. The building blocks are called “components” (that term is probably familiar to you) and it always follows the same pattern:

  • Calling Component Filters
  • Executing the Component
  • Return from the Component Filters (in reverse order of the calling)

This pattern can be clearly seen in the output, but most often it is more complicated because many components include other components, and so you end up in a tree of components being rendered.

As an example for the straight forward case we can take the “head” component of the page:

  25849 LOG Including resource MergedResource [path=/mnt/overlay/granite/ui/content/globalhead/experiencelog, resources=[/libs/granite/ui/content/globalhead/experiencelog]] (SlingRequestPathInfo: path='/mnt/overlay/granite/ui/content/globalhead/experiencelog', selectorString='null', extension='html', suffix='null')
  25892 TIMER_START{resolveServlet(/mnt/overlay/granite/ui/content/globalhead/experiencelog)}
  25934 TIMER_END{40,resolveServlet(/mnt/overlay/granite/ui/content/globalhead/experiencelog)} Using servlet BundledScriptServlet (/libs/cq/experiencelog/components/head/head.jsp)
  25939 LOG Applying Includefilters
  25943 LOG Calling filter: com.adobe.granite.csrf.impl.CSRFFilter
  25951 LOG Calling filter: com.day.cq.personalization.impl.TargetComponentFilter
  25955 LOG Calling filter: com.day.cq.wcm.core.impl.page.PageLockFilter
  25959 LOG Calling filter: com.day.cq.wcm.core.impl.WCMComponentFilter
  26885 LOG Calling filter: com.day.cq.wcm.core.impl.monitoring.PageComponentRequestFilter
  26893 LOG Calling filter: com.adobe.granite.metrics.knownerrors.impl.ErrorLoggingComponentFilter
  26896 LOG Calling filter: com.day.cq.wcm.core.impl.WCMDebugFilter
  26899 LOG Calling filter: com.day.cq.wcm.core.impl.WCMDeveloperModeFilter
  28125 TIMER_START{BundledScriptServlet (/libs/cq/experiencelog/components/head/head.jsp)#1}
  46702 TIMER_END{18576,BundledScriptServlet (/libs/cq/experiencelog/components/head/head.jsp)#1}
  46734 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMDeveloperModeFilter, inner=18624, total=19806, outer=1182
  46742 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMDebugFilter, inner=19806, total=19810, outer=4
  46749 LOG Filter timing: filter=com.adobe.granite.metrics.knownerrors.impl.ErrorLoggingComponentFilter, inner=19810, total=19816, outer=6
  46756 LOG Filter timing: filter=com.day.cq.wcm.core.impl.monitoring.PageComponentRequestFilter, inner=19816, total=19830, outer=14
  46761 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMComponentFilter, inner=19830, total=20750, outer=920
  46767 LOG Filter timing: filter=com.day.cq.wcm.core.impl.page.PageLockFilter, inner=20750, total=20754, outer=4
  46772 LOG Filter timing: filter=com.day.cq.personalization.impl.TargetComponentFilter, inner=20754, total=20758, outer=4

At the top you see the LOG statement “Including resource …” which provides you with the information what resource is rendered, including additional information like selector, extension and suffix.

As next statement we have the resolution of the renderscript which is used to render this resource, plus the time it took (40 microseconds).

Then we have the invocation of all component filters, the execution of the render script itself, which is using a TIMER to record start time, end time and duration (18576 microseconds), and the unwinding of the component filters.

If you use a recent version of the SDK for AEM as a Cloud Service, all timestamps are in microseconds, but in AEM 6.5 and older the duration measured for the Filters (inner=…, outer=…) were printed in miliseconds (which is an inconsistency I just fixed recently).

If a component includes another component, it looks like this:

8350 LOG Applying Componentfilters
   8358 LOG Calling filter: com.day.cq.personalization.impl.TargetComponentFilter
   8361 LOG Calling filter: com.day.cq.wcm.core.impl.page.PageLockFilter
   8365 LOG Calling filter: com.day.cq.wcm.core.impl.WCMComponentFilter
   8697 LOG Calling filter: com.day.cq.wcm.core.impl.monitoring.PageComponentRequestFilter
   8703 LOG Calling filter: com.adobe.granite.metrics.knownerrors.impl.ErrorLoggingComponentFilter
   8733 LOG Calling filter: com.day.cq.wcm.core.impl.WCMDebugFilter
   8750 TIMER_START{BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp)#0}
  25849 LOG Including resource MergedResource [path=/mnt/overlay/granite/ui/content/globalhead/experiencelog, resources=[/libs/granite/ui/content/globalhead/experiencelog]] (SlingRequestPathInfo: path='/mnt/overlay/granite/ui/content/globalhead/experiencelog', selectorString='null', extension='html', suffix='null')
  25892 TIMER_START{resolveServlet(/mnt/overlay/granite/ui/content/globalhead/experiencelog)}
  25934 TIMER_END{40,resolveServlet(/mnt/overlay/granite/ui/content/globalhead/experiencelog)} Using servlet BundledScriptServlet (/libs/cq/experiencelog/components/head/head.jsp)
  25939 LOG Applying Includefilters
[...]
148489 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMDeveloperModeFilter, inner=1698, total=1712, outer=14
 148500 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMDebugFilter, inner=1712, total=1717, outer=5
 148509 LOG Filter timing: filter=com.adobe.granite.metrics.knownerrors.impl.ErrorLoggingComponentFilter, inner=1717, total=1722, outer=5
 148519 LOG Filter timing: filter=com.day.cq.wcm.core.impl.monitoring.PageComponentRequestFilter, inner=1722, total=1735, outer=13
 148527 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMComponentFilter, inner=1735, total=2144, outer=409
 148534 LOG Filter timing: filter=com.day.cq.wcm.core.impl.page.PageLockFilter, inner=2144, total=2150, outer=6
 148543 LOG Filter timing: filter=com.day.cq.personalization.impl.TargetComponentFilter, inner=2150, total=2154, outer=4
 148832 TIMER_END{140080,BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp)#0}

You see the component filters, but then after the TIMER_START for the page.jsp (check the trailing timer number: #0, every timer has a unique ID!) line you see the inclusion of a new resource. For this again the render script is resolved and instead of the ComponentFilters the IncludeFilters are called, but in the majority of cases the list of filters are identical. And depending on the resource structure and the script, the rendering tree can get really deep. But eventually you can see that the the rendering of the page.jsp is completed; you can easily find it by looking for the respective timer ID.

Equipped with this knowledge you can now easily dig into the page rendering process and see which resources and resource types are part of the rendering process of a page. And if you are interested in the bottlenecks of the page rendering process you can check the TIMER_END lines which both include the rendering script plus the time in microseconds it took to render it (be aware, that this time also includes it too to render all scripts invoked from this render script).

But the really cool part is that this is extensible. Via the RequestProgressTracker you can easily write your own LOG statements, start timers etc. So if you want to debug requests to better understand the timing, you can easily use something like this:

slingRequest.getRequestProgressTracker().log("Checkpoint A");

And then you can find this log message in this screen when this component is rendered. You can use it to output useful (debugging) information or just have use its timestamp to identify performance problems. This can be superior to normal logging (to a logfile), becaus you can leave these statements in production code, and they won’t pollute the log files. You just need to have access to the OSGI webconsole, search for the request you are interested and check the rendering process.

And if you are interested, you can can also get all entries in this screen and do whatever you like. For example you can write a (request-level) filter, which calls first the next filter, and afterwards logs all entries of the RequestProgressTracker to the logfile, if the request processing took more than 1 second.

The Request Progress Tracker plus the “Recent Requests” Screen of the OSGI webconsole are a really cool combination to both help you to understand the inner working of the Sling Request Processing, and it’s also a huge help to analyze and understand the performance of request processing.

I hope that this technical deep dive into the sling page rendering process was helpful for you, and you are able to spot many interesting aspects of an AEM system just be using this tool. If you have questions, please leave me a comment below.

CRX DE driven development

A recurring problem I see in AEM project implementations is the problem of missing abstraction. A lot of code deals passes around resources, ValueMaps and even Strings (paths). And because we are supposed to build software the proper way, the called method checks (or more often: not checks) that the provided resource parameter is not null, and that the resource is of the correct type.

But instead of dealing with resources, the class names and comments suggest that the code actually dealing with products. Or website structures. Or assets. But instead of using a “product” classes (or website class, or the provided asset class) still resources are used. The abstraction is missing!

For me the root cause of this problem is the CRXDE Lite. Exactly that thing which you can open on your local AEM instance at /crx/de/. Because it shows you a very nice hierarchical view to the repository, it shows you paths, and properties. And if a developer starts to build a mental model of something, this tool comes in quite handy. Because you can reach everything via path, which is a String! So instead of expressing relations between concepts I see often this:

String path = …
String pathResource = resourceResolver.getResource(path);

And because we know it’s an existing, and we want to determine the parent resource, I see

String path = …
int lastSlash = path.lastIndexOf("/");
String parentPath = path.substring(0,lastSlash);
Resource parentResource = resourceResolver.getResource(parentPath);

Which is hilarious, because

pathResource.getParent();

is much easier to use (and did you spot the off-by-one bug in the String operation example? And what does happen if the path ends already with a slash?). But that still leaves the question, why you need to get the parent resource. Maybe a

ProductCategory category = myProduct.getCategory();

is a more expressive way to describe the same. I would definitely prefer it.

So CRXDE is your biggest enemy when designing your application. If you are a seasoned AEM developer, my recommendation to you: Don’t explain your application with CRXDE. Rather use proper abstractions. Don’t do CRXDE driven development!

If that topic sounds familiar to you: I did a talk on the AdaptTo() conference 2020 regarding this topic, you can find the recording here. There I explain the problem in more detail, also including some better examples 🙂

Long running sessions and clustering

In the last blog post I briefly talked about the basics what to consider when you are writing cluster-aware code. The essence is to be aware of your write activities, and make sure that the scheduled activities are running only on a single cluster node and not on many or all of them.

Today’s focus is on the behavior of JCR sessions with respect to clustering. From a conceptual point of view there is hardly a difference to a single-node cluster (or standalone instance), but the presence of more cluster nodes add a new angle of potential problems to it.

When I talk about JCR, I am thinking of the Apache Oak implementation, which is implemented on top of the MVCC pattern. (The previous Jackrabbit implementation is using a different approach, so this whole blog post does not apply there.) The basic principle of MVCC is that each session is clearly separated from any other session which is open in parallel. Also any changes performed on a session is not visible to other sessions unless

  • the other session is invoking session.refresh() or
  • the other session is opened after the mentioned session is closed.

This behavior applies to all sessions of a JCR repository, no matter if the are opened on the same cluster node or not. The following diagram visualizes this

Diagram showing how 2 sessions are performing changes to the repository whithout seeing the changes of the other as long as they don’t use session.refresh()

We have 2 sessions A1 and B1 which are initiated at the same time t0, and which perform changes independently of each other on the repository, so session B1 cannot see the changes performed with A1_1 (and vice versa). At time t1 session A1 is refreshed, and now it can see the changes B1_1 and B1_2. And afterwards B1 is refreshed as well, and can now see the changes A1_1 and A1_2 as well.

But if a session is not refreshed (or closed and a new session is used), it will never see the changes which happened on the repository after the session has been opened.

As said before, these sessions do not need to run on 2 separate cluster nodes, you get the same behavior on a single cluster node as well. But I mentioned, that multiple cluster nodes are a special problem here. Why is that case?

That problem are OSGI services in the background, which perform a certain job and write data to the JCR repository. In a single-node cluster this not a problem, because all of these activities go through that single service; and if that service uses a long-running JCR session for it, that will never be a problem. Because this service is responsible for all changes, and the service can read and write all the relevant data. In a cluster with more than 2 nodes, each cluster node might have that service running, and the invocations of the services might be random. And as in the diagram above, on cluster node A the data A1_1 is written. And on cluster node 2 the data point B1_1 is written. But they don’t see each other’s changes if they don’t refresh the session! And in most applications, which are written for single-node AEM instances, session.refresh() is barely used, because in such situations there’s simply no need for it, as this problem never occurred.

So when you are migrating your application to AEM as a Cloud Service, review your applications and make sure that you find all long-running ResourceResolvers and JCR sessions. The best option is then to remove these long-running sessions and replace them with short-living ones, which are closed if the job is done. The second-best option is to introduce a session.refresh(), so the session sees any updates which happend to the repository in the meanwhile. (And btw: if you registering an ObservationListener in that session, you don’t need a manual refresh, as this refresh is done by the ObservationListener method anyway; what would it be for if not for reporting changes to the repository, which happen after opening the session?)

That’s all right now regarding cluster-aware coding. But I am sure that there is more to come 🙂

Cluster aware coding in AEM

With AEM as a Cloud Service quite a number of small things have changed; and next to others you also get real clustering support in the authoring environment. Which is nice, because it gives you downtime-less authoring during deployments.

But this cluster also comes with a few gotchas, and one of them is that your application code needs to be cluster-aware. But what does that mean? What consequences does it have and what code do you have to change if you have never paid attention to this aspect?

The most important aspect is to do “every change only once“. It doesn’t make sense that 2 cluster nodes are importing the same set of data. A special version of this aspect is “avoid concurrent writes to the same node“, which can happen when a scheduled job is kicked off at the same time on all nodes, and this job is trying to change something in the repository. In this case you don’t only have overhead, but very likely a lot of exceptions.

And there is a similar aspect, which you should pay attention to: connections to external systems. If you have a cluster, running the same code and configs, it’s not always wanted that each cluster node reaches out to that external system. Maybe you need to the update it with the latest content only once, because it triggers some expensive processing on their side, and you don’t want to have that triggered two or three times, probably pretty much at the same time.

I have mentioned you 2 cases where a clustered application can be behave differently than a single-node environment, now let me show you how you can make your application cluster-aware.

Scheduled jobs

Scheduled jobs are a classic tool to execute certain jobs at a certain time. Of course we could use the Sling Scheduler directly, but to make the execution more robust, you should wrap it into a Scheduled Sling Job.

See the Sling Jobs website for the documentation and some example (although the Javadocs are missing the ScheduleBuilder class, but here’s the code). And of course you should check out Kaushal Mall’s post with even more examples.

Jobs give you the guarantee, that this job is going to be executed only at least once.

Use the Sling Scheduler only for very frequent jobs (e.g. once every 5 minutes), where it doesn’t matter if one execution is skipped, e.g. because the instance was just restarting. To limit the execution of such a job to a single node, you can annotate the job runner with this annotation:

@Property (name="scheduler.runOn", value="SINGLE")

(see the docs)

What about caches?

In-memory caches are often used to speed up operations. Most often they contain the results of previous operations which are then reused; cache elements are either actively purged or expire using a time-to-live.

Normally such caches are not affected by clustering. They might contain different items with potentially different values in the cluster nodes, but that must never be a problem. If that is a problem, you have to look for a different approach, e.g. persisting the data to the repository (if they are not already coming from there) or externalizing the cache (e.g to a redis or memcached instance).

Also, having a simpler application instead of the highest-cache-hit ration possible is often a good trade-off.

Ok, these were the topics I wanted to discuss here. But expect a blog post about one of my favorite topics: “Long running sessions and clustering“.

Writing unittests for AEM (part 4): OSGI services mock services

In the last parts of this small series (part 1, part 2, part 3) I covered some basic approaches how you can use the Sling and AEM mocking libraries to ease writing unittests. The examples were quite basic and focussed, but in reality many test cases turn out to be much more complex.

And especially when your code has dependencies to other OSGI services, tests can get tricky. So today I want to walk you through some unittest I wrote some time ago, it’s a unittest for the EnsureOakIndex functionality (EnsureOakIndexJobHandlerTest).

The interesting part is that the required EnsureOakIndex service references 4 other services in total; if they are not present, my EnsureOakIndex service will never start properly. Thus you have to fullfill all service requirements of an OSGI service in the unittest as well (at least if you want to use SlingContext like I do here).

The easiest way to solve this is to rely on predefined services which are part of the SlingMocks or AemMocks. The second best way is to create simple mocks and register them a service, so the dependency is fulfilled. That’s definitely a convenient way if your tests do not invoke any of the service methods at all.

Thus the setup() method of my unittests are often pretty large, because there I prepare and inject all other services which I need to make my software-under-test work.

And because this setup works quite well and reliably, I always use AemContext for my unittests (or SlingContext, but as I haven not yet observed any difference in test execution time, I often prefer just AemContext because it comes with some more sevices). Just if I don’t need resources, nodes and no OSGI, I stick with plain junit. For everything else AemContext removes the necessity for mocking a lot.

Writing unittests for AEM (part 3): Mocking resources

I introduced SlingMocks in the recent blog posts (part 1, part 2), and till now we only covered some basics. Let’s dig deeper now and use one of its coolest features: Mocking resources.

Well, to be honest, we don’t mock resources. We do something much better: We build an in-memory structure which represent sling resources. We can even map that to an in-memory JCR-repository!

For this blog post I created a very simple Sling model, which supposed to build a classic navigation. For simplicity I omitted nearly all of the logic which makes up a production-ready navigation component, and want to add just the pages below the site root to it.

The relevant files in my github repo:

The interesting part is the unit test. This time I used the AemMocks library from wcm.io, because it provides all the magic of the SlingMocks, plus some AEM specific objects I would like to use. As the AemContext class inherits directly from the SlingContext class, we can just use it as a drop-in replacement.

Loading the test data from a JSON flle in the test resources

The setup method is very simple: It loads a JSON structure and populates a in-memory resource tree with it (adds everything below a resource “/content”). Using this approach there’s no longer the need to mock resources, properties/value maps, and mock the relation between these. This is all done by the SlingMocks/AemMocks framework.

Thus a testcase looks like this:

No boilerplate code anymore

This simple example shows how easy it can be to write unit tests. The test code is very concise and easy to read and understand. In my opinion the biggest issue when writing unit tests now is creating the test content 🙂

Some notes to the test content:

  • My personal style is to put the test content into the same structure than the java packages. This makes it very easy to find the testcontent and keeps the testcontent and the unittest together.
  • You can write the JSON manually. But you can also use CRXDE Lite to create some structure in your local AEM instance and then export it with “http://localhost:4502/content/myproject/testcontent.tidy.-1.json ” and paste it into the JSON file. I find this way much more convenient, especially if you already have most of the test content ready. But if you do that, please clean up the unnecessary properties in the test data. They are polluting the test content and make it harder to understand.

Writing unit tests for AEM (part 2): Maven Setup

In the previous post I showed you how easy it is to start using SlingContext. If you start to use this approach in your own project and just copy the code, your Maven build is likely to fail with messages like this:

[ERROR] testActivate_unconfiguredParams(de.joerghoh.cqdump.samples.unittest.ReplicationServletTest)  Time elapsed: 0.01 s  <<< ERROR!
org.apache.sling.testing.mock.osgi.NoScrMetadataException: No OSGi SCR metadata found for class de.joerghoh.cqdump.samples.unittest.ReplicationServlet at org.apache.sling.testing.mock.osgi.OsgiServiceUtil.injectServices(OsgiServiceUtil.java:381)

The problem here is not the use of the SlingMock library itself, but rather the fact that we use SlingMocks to test code which uses OSGI annotations. The fix itself is quite straight-forward: We need to create OSGI metadata also for the unittests and not only for the bundling (SlingMock is reading these metadata for the test execution).

That’s the reason why there’s the need to have a dedicated execution definition for the maven-bundle-plugin:

https://github.com/joerghoh/unittest-demos/blob/master/core/pom.xml#L35

Also you need to instruct the maven-bundle-plugin to actually export the generated metadata to the filesystem using the “<exportScr> statements; this should be the default in my opinion, but you need to specify it explicitly. Also don’t forget to add the “_dsannotation” and “_metatypeannotations” statements to its instruction section:

https://github.com/joerghoh/unittest-demos/blob/master/core/pom.xml#L43

And even then it will fail, if you don’t upgrade the maven-bundle-plugin to a version later than 4.0:

https://github.com/joerghoh/unittest-demos/blob/master/pom.xml

Ok, if you adapted your POMs in this way, your SlingMock based unittests for OSGI r6-based services should run fine. Now you can start exploring more features of SlingMocks.

Writing unit tests for AEM — using SlingMocks

Over the course of the last years the tooling for AEM development improved a lot. While some years ago there was hardly an IDE integration available, today we have dedicated tools for Eclipse and IntelliJ (https://helpx.adobe.com/experience-manager/6-4/sites/developing/using/aem-eclipse.html). Also packaging and validation support was poor, today we have the opensourced filevault and tools like oakpal (thanks Marc!)

But with SlingMocks we also have much better unittest tooling (thanks a lot to Stefan Seifert and the Sling people), so we much more and better/easier tooling than just Mockito and Powermock. SlingContext (and its extension AemContext) allows you create unit tests quite easily. Using them can help you get rid of mocking Sling Resources, repo access and many things more.

On top of that, SlingMock can easily work with the new OSGI r6 annotations, which allow you to define OSGI properties in Pojos. Mocking the @ObjectClassDefinition classes isn’t that easy, because they are essentially annotations …

To illustrate that, I have created a minimal demo with a servlet and testcases for it (source code at Github). Basically the functionality of the class itself is not relevant for this article, but we want to focus on aspects how you utilize the frameworks best to avoid boilerplate code.

The code is quite simple, but to make unittesting a bit more challenging, it uses the new OSGI r6 annotations for OSGI configuration (using the @Designate and the @ObjectClassDefinition annotations) plus a referenced service.

https://github.com/joerghoh/unittest-demos/blob/master/core/src/main/java/de/joerghoh/cqdump/samples/unittest/ReplicationServlet.java#L29

If you try to mock the ReplicationServlet.Config class the naive way, you will find out, that it’s an annotation, which is referenced in the activate() method. I always failed to mock it somehow, so I switched gear and started to use SlingMock for it. (I don’t want to say that it is not possible, but it’s definitly not straight-foward, and in my opinion writing unit-tests should be straight forward, otherwise they are not created at all.)

With SlingMocks the approach changes. I am not required to create mocks, but SlingMocks provides a mocked OSGI runtime we can use. That means, that we create the OSGI parameters as a map to tell the SlingContext object to register our service with these parameters (line 59).

https://github.com/joerghoh/unittest-demos/blob/master/core/src/main/java/de/joerghoh/cqdump/samples/unittest/ReplicationServlet.java

Because SlingContext implements quite a bit of the OSGI semantics, it also requires that all referenced services are available (if these are static references). Therefor I use Mockito to mock the Replicator and I register the mock to provide the Replicator service. In realworld I could verify the interactions of my servlet with that mock.

This basic example illustrates how you can use SlingMocks to avoid a lot of mocking and stubbing. This example does not utilize the full power of SlingMocks yet, we are just scratching at the surface. But we already have some benefit : If you switch from SCR annotations to OSGI annotations, your SlingMock unittests don’t need any change, because it provides an OSGI-like environment, and there the way how metatypes are generated and injected are abstracted away.

From SCR annotations to OSGI annotations

Since the beginning of AEM development we used annotations to declare OSGI services; @Component, @Service, @Property and @Reference should be known to everyone how has ever developed backend stuff for AEM. The implementation behind these annotations came from the Apache Felix project, and they were called the SCR annotations (SCR = Service Component Runtime). But unlike the Service Component Runtime, which is part of the OSGI standard for quite some, these annotations were not standardized. This changed with OSGI Release 6.

With this release annotations were also standardized, but they are 100% compatible to the SCR annotations. And there are a lot of resources out there, which can help to explain the differences:

I recently worked on migrating a lot of the code from ACS AEM Commons from SCR annotations to OSGI annotations, and I want to share some learning I gained on the way. Because in some subtle areas the conversion isn’t that easy.

Mixed use of SCR annotations and OSGI annotations

You can mix SCR annotations and OSGI annotations in a project, you don’t need to migrate them all at once. But you can to be consistent on a class level, you cannot mix SCR and OSGI annotations in a single class. This is achieved by an extension to the maven-bundle-plugin (see below).

Migrating properties

SCR property annotations give you a lot of freedom. You can annotate them on top of the class (using the @Properties annotation as container with nested @Property annotations), you can annotate individual constant values to be properties. You can make them visible in the OSGI webconsole (technically you are creating a metatype for them), or you can mark them as private (no metatype is created).

With OSGI annotations this is different.

  • Metatype properties are handled in the dedicated configuration class marked with @ObjectClassDefinition. They cannot be private.
  • Properties which are considered to be private are attached to the @Component annotation. They cannot be changed anymore.

A limitation from a backward compatibility point of view: With SCR annotations you are not limited in the naming of properties, next to characters often the “.” (dot) and the “-” (dash, minus) was used. With OSGI r6 annotations you can easily create a property with a “.” in it

String before_after() default "something";

will result in the property with the name “before.after”; but with OSGI r6 annotations you cannot create properties with a “-” in it. Only OSGI r7 (which is supported in AEM 6.4 onwards) supports it with a construct like this:

String before$_$after() default "something";

If you want to keep compatibility with AEM 6.3, expect the breakage of property names or you need to investigate in workarounds (see #1631 of ACS AEM Commons). But my recommendation is to avoid the use of the “-” in property names alltogether and harmonize this in your project.

Update: I posted an additional blog post specifically on migrating SCR properties, mostly in the context of OSGI DS and OSGI Metatypes.

Labels & description

All the metatype stuff (that means, how OSGI configurations appear in the /system/console/configMgr view) is handled on the level of the @ObjectClassDefinition annotation and the method annotated with it. With the SCR annotations this was all mixed up between the @Component annotation and the @Property fields.

Update the tooling to make it work

If you want to work with OSGI annotations, you should update some elements in your POM as well:

  • Update the maven-bundle-plugin to 4.1.0
  • Remove the dependency to the maven-scr-plugin
  • Add a dependency to org.osgi:org.osgi.annotations:6.0.0 to your POM.
  • Then you need to add an additional execution to your maven-bundle-plugin (it’s called “generate-scr-metadata-for-unittests“) and update its configuration (see it on ACS AEM Commons POM).

The interesting part is here is the plugin to the maven-bundle-plugin, which can also handle SCR annotations; this statement allows you to mix both types of annotations.

This blog post should have given you some hints how you migrate the SCR annotations of an existing codebase to OSGI annotations. It’s definitly not a hard task, but some details can be tricky. Therefor it’s cool if you have the chance to mix both types of annotations, so you don’t need a big-bang migration for this.

Detecting JCR session leaks

A problem I encounter every now and then are leaking JCR sessions; that means that JCR sessions are opened, but never closed, but just abandoned. Like Files, JCR sessions need to be closed, otherwise their memory is not freed and they cannot be garbage collected by the JVM. Depending on the number of sessions you leave in that state this can lead to serious memory problems, ultimately leading to a crash of the JVM because of an OutOfMemory situation.

(And just to be on the safe side: In AEM ootb all ResourceResolvers use a JCR session internally; that means whatever I just said about JCR sessions applies the same way to Sling ResourceResolvers.)

I dealt with this topic already a few times (and always recommended to close the JCR sessions), but today I want to focus how you can easily find out if you are affected by this problem.

We use the fact that for every open session an mbean is registered. Whenever you see such a statement in your log:

14.08.2018 00:00:05.107 *INFO* [oak-repository-executor-1] com.adobe.granite.repository Service [80622, [org.apache.jackrabbit.oak.api.jmx.SessionMBean]] ServiceEvent REGISTERED

That’s says that an mbean service is registered for a JCR session; thus a JCR session has been opened. And of course there’s a corresponding message for unregistering:

14.08.2018 12:02:54.379 *INFO* [Apache Sling Resource Resolver Finalizer Thread] com.adobe.granite.repository Service [239851, [org.apache.jackrabbit.oak.api.jmx.SessionMBean]] ServiceEvent UNREGISTERING

So it’s very easy to find out if you don’t have a memory leak because of leaking JCR sessions: The number of log statements for registration of these mbeans must match the number of log statements for unregistration.

In many cases you probably don’t have exact matches. But that’s not a big problem if you consider:

  • On AEM startup a lot of sessions are opened and JCR observation listeners are registered to them. That means that a logfile with AEM starts and stops (and the number of starts do not match the number of stops) it’s very likely that these numbers do not match. Not a problem.
  • The registration (and also the unregistration) of these mbeans often happens in batches; if this happen during logfile rotation, you might have an imbalance, too. Again, not per se a problem.

It’s getting a problem, if the number of sessions opened is always bigger than the number of sessions closed over the course of a few days.

$ grep 'org.apache.jackrabbit.oak.api.jmx.SessionMBean' error.log | grep "ServiceEvent REGISTERED" | wc -l
212123
$ grep 'org.apache.jackrabbit.oak.api.jmx.SessionMBean' error.log | grep "ServiceEvent UNREGISTERING" | wc -l
1610
$

Here I just have the log data of a single day, and it’s very obvious, that there is a problem, as around 220k sessions are opened but never closed. On a single day!

To estimate the effect of this, we need to consider that for every of these log statements these objects are retained:

  • A JCR session (plus objects it reaches, and depending on the activities happening in this session it might also include any pending change, which will never going to be persisted)
  • A Mbean (referencing this session)

So if we assume that 1kb of memory is associated with every leaking session (and that’s probably an very optimistic assumption), this would mean that the system above would loose around 220M of heap memory every day. This system probably requires a restart every few days.

How can we find out what is causing this memory leak? Here it helps, that Oak stores the stack trace when opening sesions as part of the session object. Since around Oak 1.4 it’s only done if the number of open sessions exceeds 1000; you can tune this value with the system property “oak.sessionStats.initStackTraceThreshold”; set it to the appropriate value. This is a great help to find out where the session is opened.

And then go to /system/console/jmx, check for the “SessionStatistics” mbeans (typically quite at the bottom of the list) and select on the most recent ones (they have the openening date already in the name)

session information in the mbean view
session information in the mbean view

And then you can find in the “initStackTrace” the trace where this session has been opened:

Stacktrace of an open JCR session
Stacktrace of an open JCR session

With the information at hand where the session has been opened it should be obvious for you to find the right spot where to close the session.
If you spot a place where a session is opened in AEM product code but never closed, please check that with Adobe support. But be aware, that during system startup sessions are opened and will stay open while the system is running. That’s not a problem at all, and please do not report them!

It’s only a problem if you have a at least a few hundreds session open with the very same stack trace, that’s a good indication of such a “leaking session” problem.

A good followup reading on AEM HelpX pages with some details how you can fix it.