Best practices for AEM unittests

Some time ago I already wrote some posts (1, 2, 3) about unit testing with AEM, especially in combination with SlingMocks / AEM Mocks.

In the last months I also spent quite some time in improving the unittests of ACS AEM Commons, mostly in the context of updating the Mockito framework from 1.9x to a more recent version (which is a pre-requisite to make the complete build working with Java 11). During that undertaking I reviewed a lot of unit tests which required adjustments; and I came across some patterns which I also find (often?) in AEM projects. I don’t think that these patterns are necessarily wrong, but they make tests hard to understand, hard to change and often these tests make production code overly complex.

I will list a few of these patterns, which I consider problematic. I won’t go that far and call them anti-patterns, but I will definitely look closely at every instance I come across.

Unittests don’t matter, only test coverage matters.
Sometimes I get the impression, that the quality of the tests don’t matter, but only the resulting test coverage (as indicated by the test coverage tools like jacoco). That paying attention to the code quality of the tests and investing time into refactoring tests is wasted time. I beg to differ.
Although unit tests are not deployed into a production environment, the usual quality measures should be applied to unit tests as well, because it makes them easier extensible and understandable. And the worst which can happen to production code is that a bugfix is not developed in a TDD (build a failing testcase first to prove your error is happening) way because it is to much work to extend the existing tests.

Mocking Sling Resources and/or JCR nodes
With the presence of AEM Mocks there should not be any need to manually mock Sling Resources and JCR nodes. It’s a lot of work to do that, especially if you compare it to load a JSON structure into an in-memory repository. Same with ResourceResolvers and JCR sessions. So don’t mock Sling resources and JCR nodes! That’s a case for AemMocks!

Using setters in services to set references
When you want to test services, the AEM Mock framework handles injections as well, you just need to use the default constructor of your service to instantiate it, and then pass it to the context.registerInjectActivate() method. If required create the referenced services before as mocks and register them as well. AemMocks comes with ways to test OSGI services and components in a very natural way (including activations and injection of references), so please use it.
There is no need to use setter methods for the service references in the production code just for this usecase!

If you are looking for an example how these suggestions can be implemented, you can have a look the example project I wrote last year.

Of course this list is far from being complete; if you have suggestions or more (anti-) patterns for unittests in the AEM area, please leave me a comment.

AEM coding pattern: Run-mode specific code

It is very typical how have code which is supposed to run not on all environments, but only on some. For example you might have code which is supposed to import data only on the authoring instance.

Then code often looks like this:

if (isAuthoring) {
  // import data
}

boolean isAuthoring() {
  return slingSettingsService.getRunmodes().contains("author");
}

This code does what it’s supposed to do. But there can be a problem, that you want to run this code not only on authors, but (for whatever reasons) also on publish. Or you don’t want to run the code on UAT authors.
In such cases this code does not work anymore, because it’s not flexible enough (the runmode is hardcoded); any change requires a code change and deployment.

A better way is to reformulate the requirement a bit: “The data import will only run if there is the ‘importActive’ flag set to true”.

If you design this flag “importActive” as an OSGI config, and combine it with runmode dependent configuration, then you can achieve the same behaviour as above, but be much more flexible. You can even disable it (and if only for a certain time).

The code could then look like this

@Property (boolValue="true")
private static final String IMPORT_ACTIVE_PROP = "importActive";
private boolean importActive();

Protected activate(ComponentContext ctx) {
  importActive = PropertiesUtil (ctx.getProperties().get(IMPORT_ACTIVE_PROP));
}

if (importActive) {
  // import data
}

Now you translate the requirement “imports should only happen on authoring” into a configuration decision, and it’s no longer hardcoded. And that’s why the reason why I will be picky on code reviews about it.

Do not write to the repository in event handling

Repository writes are always resource intensive operations, which always come with a cost. First of all, the write operation helds a number of locks, which limit the concurrency in write operations in total.  Secondly the write itself can take some time, especially if the I/O is loaded or you are running in a cluster with MongoDB as backend; in this case it’s the latency of the network connection plus the latency of MongoDB itself. And third, every write operations causes async index updates, triggers JCR observation and Sling resource events etc. That’s the reason why you shouldn’t take write operations too easy.

But don’t be scared to write to the repo because of performance reason, no way! Instead try to avoid unnecessary writes. Either batch them and collapse multiple write operations into a single transaction, if the business case allows it. Or avoid the repository writes alltogether, especially if the information is not required to be persisted.

Recently I came across a very stressing pattern of write operations: Write amplification. A Sling Resource Event listener was listening for certain write events to the repository; and when one of these was received (which happened quite often), a Sling Job has been created to handle this event. And the job implementation just did another small change to the repository (write!) and finished.

In that case a single write operation resulted in:

  • A write operation to persist the Sling Job
  • A write operation performed by the Job implementation
  • A write operation to remove the Sling Job

Each of these “regular” write operations caused 3 subsequent writes to the repository, which is a great way to kill your write performance completely. Luckily no one of these 3 additional write operations caused the event listener to create a new Sling Job again … That would have caused the same effect as “out of office” notifications in the early days of Microsoft Exchange (which didn’t detect these and would have sent an “out-of-office” reply to the “out-of-office” sender): A very effective way of DOSing yourself!

a flood of writes and the remains of performance

But even if that was not the case, it resulted in a very loaded environment reducing the subjective performance a lot; threaddumps indicated massive lock contention on write operations. When these 3 additional writes have been optimized (effectivly removed, as collecting the information in memory and batch-writing it after 30 seconds was possible) the situation improved a lot, and the lock contention was gone.

The learning you should take away from this scenario: Avoid writing to the repository in JCR Observation or Sling event listeners! Collect the data and write outside of these handlers in a separate thread.

PS: An interesting side effect of sling event listeners taking a long time is, that these handlers are blacklisted after they took more than 5 seconds to process (e.g. because they have been blocked on writing). Then they are not fired again (until you restart AEM), if you don’t explicitly whitelist them or turn of this feature completly.

Design pattern: Configuration of OSGI services

When you are an AEM backend developer, the pattern is very familiar: Whenever you need to provide configuration data to the service, you collect this data in the activate() method (by good tradition that’s the name of the method annotated with the “@Activate” annotation). I use this pattern often and normally it does not cause any problems.

But recently in my current project we ran into an issue which caused headaches. We needed to provide an API Key which is supposed to change every now and then, and therefor is not configured by an OSGI property, but instead stored inside the repository, so it can be authored.

We deployed the code, entered the API key, and … Guess what? It was not working at all. The API key was read in the Activate method, but at the time the key was not yet present. And the only chance to make it work was to restart the service/bundle/instance. And besides the initial provisioning it would have required a restart every time the key has been changed.

That’s not a nice situation when you try to automate your deployment (or not to break your automated deployment). We had to rewrite our logic in a way, that the API key was read periodically (every minute) from the repository. Of course the optimal way would have been to use JCR observation or an Sling Event Handler to detect any changes on the API Key Node/Resource immediately …

So whenever you have such “dynamic” configuration data, you should design your code in a way, that it can cope with situations that this configuration is not there (yet) or changes. The least thing you want to do is to restart your instance(s) because such a configuration change has happened.

Let’s formulate this as an pattern: Do not read from the repository in the “activate” method of a service! The content you read might change during runtime, and you need to react on it.

AEM coding best practice: No String operations on paths!

I recently needed to review project code in order to validate if it makes problems when upgrading from AEM 5.6 to AEM 6.x; so my focus wasn’t on the code in the first place, but on some other usual suspects (JCR queries etc). But having seen a few dozen classes I found a pattern, which I then found all over the code: the excessive use of String operations. With a special focus on string operations on repository paths.

For example something like this:

String[] segments = resource.getPath().split("/");
String settingsPath = "/" + StringUtils.join(segments,"/",0,2) + "/settings/jcr:content";
Resource settings = resourceResolver.get(settingsPath);
ValueMap vm = settings.adaptTo(ValueMap.class);
String language = vm.get("language");

(to read settings, which are stored in a dedicated page per site).

Typically it’s a weird mixture of String methods, the use of StringUtils classes plus some custom helpers, which do things with ResourceResolvers, Sessions and paths. Spread all over the codebase. Ah, and it lacks a lot of error checking (what if the settings page doesn’t contain the “language” property? adaptTo() can return “null”).

Sadly that problem not limited to this specific project code, I found it in many other projects as well.

Such a code structure is a clear sign for the lack of abstraction and guidance. There are no concepts available, which eliminate the need to operate on strings, but the developer is left with the only abstraction he has: The repository and CRXDE Lite’s view on it. He logs into the repository, looks at the structure and then decides how to mangle known pieces of information to get hold of the things he needs to access. If there’s noone which reviews the code and rejects such pieces, the pattern emerges and you can find it all over the codebase. Even if developers create utility classes for (normally every developer creates one on its own), it’s a bad approach, because these concepts are not designed (“just read the language from the settings page!”), but the design “just happens“; there is no focus on it, therefor quality is not enforced and error handling typically quite poor.

Another huge drawback of this approach: It’s very hard to change the content structure, because at many levels assumptions about the content structure are used, which are often not directly visible. For the example the constants “0” and “2” in the above code snippets determine the path prefix, but you cannot search for such definitions (even if they are defined as constant values).

If the proper abstraction would be provided, the above mentioned code could look like this:

String language = "en";
Settings settings = resource.adaptTo(Settings.class);
If (settings != null) {
  language = settings.getLanguage();
}

This approach is totally agnostic of paths and the fact that settings are stored inside a page. It hides this behind a well-defined abstraction. And if you ever need to change the content structure you know where you need to change your code: Only in the AdapterFactory.

So whenever you see code which uses String operations on paths: Rethink it. Twice. And then come up with some abstraction layer and refactor it. It will have a huge impact on the maintainability of your code base.

JCR Observation in clustered AEM instances

Clustering AEM got a bit different with the introduction of OAK. But with the enforcement of the MVCC model in Oak I also advise to revisit some patterns you might got used to. Because some code which worked with no apparent problem in AEM 5.x might cause problems now.

One thing I would check are the JCR Observation Listeners. Using JCR observation is a common way to react on changes in the repository and this is common pattern since CQ 5.0. So what’s the problem with that? The problem is that many JCR observation handlers are not written with clustering in mind.

Take the example that you need to react on changes in the repository and in turn modify something else. The usual approach is to have a service like this (omitting a lot of the boilerplate …)

public class MyListener implements EventListener {

 @Activate
 protected void activate() {
  ...
  ObservationManager om = session.getWorkspace().getObservationManager();
  om.addEventListener (this, 
   Event.NODE_ADDED,
   "/content/mysite",
   null,
   new String[]{"cq:Page"},
   true,
   true);
  ...
 }

 public onEvent (EventIterator events) {
  // iterate through the events and change something in the repository.
 }

}

This works very well in any non-clustered environment, because there is only a single event handler performing these changes. In clustered environments the situation is different, because now on each cluster node there is such a event handler active. And each one wants to perform the repository changes.
In that case you’ll see a lot of Oak exceptions (on all cluster nodes) which indicate that nodes have been modified externally (outside of the current session) and that a merge was not possible. This is because the changes happen in (quasi-) parallel, but not visible to the currently open sessions, thus causing these exceptions.

The only solution to this problem is to execute the EventListener only on a single node or to handle every event by exactly one event handler and not on all.

Handling every observation event on exactly handler is the elegant and scalable solution. The idea is to handle on every cluster node only the changes which happen on this cluster nodes („local events“). While the JCR API doesn’t have any notion of cluster and the Observation API does not give any information if a event is local or not, the Jackrabbit implementation (which Oak is using here) supports this through the JackrabbitObservationManager. As you can see in the following snippet, only the registration of the ObservationHandler changes, but not the handler itself.

public class MyScalableListener implements EventListener {

 @Activate
 protected void activate() {
  ...
  JackrabbitEventFilter ef = new JackrabbitEventFilter()
   .setAbsPath("/content/mysite")
   .setNodeTypes(new String[{"cq:Page"})
   .setEventTypes(Event.NODE_ADDED)
   .setIsDeep(true)
   .setNoExternal(true);
  JackrabbitObservationManager om = (JackrabbitObservationManager) session.getWorkspace().getObservationManager();
  om.addEventListener (this, ef);
  ...
 }

 public onEvent (EventIterator events) {
  // iterate through the events and change something in the repository.
 }
}

Through the Jackrabbit API extension you can register you EventListener to only handle local changes only and ignore any external ones, which are generated on another cluster nodes (using the setNoExternal(true) call). This is a scalable solution because the events handled at the location where they are generated, and no cluster nodes gets a bottleneck because of this.

So whenever you write an ObservationHandler and especially when you use a cluster, you should review your code and make sure, that you avoid concurrent access to the same resource. Of course there are many ways to have concurrent access even without clustering, but when you actually use clustering, the JCR observation handlers are the easiest piece of code to check and fix.

Resource path vs URL and rewriting links

Today I want to discuss some aspects of an AEM application, which is rarely considered during application development, but which normally gets very important right before a golive: the path element of a URL, and how it is constructed (either in full version or in a shortened one).

Newcomers to the AEM world sometimes ask how the public URLs are determined and maintained; from their experience with older or other CMS systems pages have an ID and this ID has to be mapped somehow to a URL.
Within AEM this situation is different, because the author creates a page directly in the tree structure of a site. And the name of the page can be directly mapped to a URL. So if an author creates a page /content/mysite/en/news/happy-new-year-2016, this page can be reached via https://HOST/content/mysite/en/news/happy-new-year-2016.html (in the simplest form).

From a technical point of view, the resource path is mapped to the path-element of a URL. In many cases this is a 1:1 mapping (that means, that the full resource path is taken as path of the URL). Often the „many“ means „in development environments“, because in production environments these kinds of URLs are long and contain redundant informations, which is something you should avoid. A URL also contains a domain part, and this domain part often carries information, so it isn’t needed in the path anymore.
So instead of „https://mysite.com/content/mysite/en/news.html“ we rather prefer „https://mysite.com/en/news.html“ and map only a subset of the resource path.

When mapping the resource path to a URL you must be careful, because the other way (the mapping of URL to resource path) has to work as well, and there must be exactly 1 mapping.

Such kind of mappings (I often call the mapping „resource path to URL path“ a forward mapping and the „URL path to resource path“ a reverse mapping) can be created using the /etc/map mechanisms . In a web application you need to use both mappings:

  1. when the request is received the URL path has to get mapped to a resource, so the sling resource processing can start.
  2. When the rendered page contains links to other pages, the resource path of these pages has to be provided as URL path.

(1) is done automatically by the sling if the correct ruleset is provided. (2) is much more problematic, because all references to resources provided by AEM have to be rewritten. All references? Generally spoken yes, I will discuss this later on.

This mapping can be done through the 2 API methods of the resource resolver:

You might wonder, why you never use these 2 methods in your own code,even if I wrote above, that all the links to other pages need to rewritten. Basically you don’t have to do this, because the HTML created by the rendering pipeline (including all filters) is streamed through the Sling Output Rewriting Pipeline. This chain contains a rewriter rule, which scans through all the HTML and tries to apply a forward mapping to all links.

But it does only run on HTML output, but there are other elements of a site, which contain references to content stored in AEM as well, for example Javascript or CSS files. References contained in these files are not rewritten, but delivered as they are stored in the repository. In many cases the setup is designed in a way, that a 1:1 mapping still works; but that’s not always possible (or wanted).

So please take this as an advice: Do not hardcode a path in CSS or Javascript files if there’s a chance that these paths need to be mapped.
Rewriting other formats than HTML is not part of AEM itself; of course you can extend the defaults and provide a rewriting capability for Javascript and CSS as well, but that’s not an easy task.)

The question is, if you really have to rewrite all resource paths at all. In many cases it is ok just to have the URLs of the HTML pages looking nice (because these are the only URLs which are displayed prominently) . But all the other resources (e.g assets, CSS and Javascript files) don’t need to get mapped at all, but there the default 1:1 mapping can be used. Then you’re fine, because you only have to do the mapping once in /etc/map and that’s it.

The Apache mod_rewrite modules also offers very flexible ways to do reverse mapping, but it lacks the a way to apply a forward mapping to the HTML pages (as the Sling Output Rewriter does). So mod_rewrite is a cool tool, but it is not sufficient to completely cover all aspects of resource mapping.

How can I avoid Oak write/merge conflicts?

Sandeep asked in a comment to the previous posting:

Even if your sessions are short, and you have made a call to session.refresh(true), it is possible that some one made a change before you did a session.save(), right? So, what is the best practice in dealing with such a scenario?
Keep refreshing (session.save(true)) in a loop, until your session.save() is successful or until you hit an assumed maximum number of attempts limit?
Or is there any other recommended best practice?

That’s a good question, but also a question with no satisfying answer. Whenever you want to modify nodes in the repository, there’s a change that the same nodes are changed in parallel, even you change only a single node or property. In reality, this rarely happens. Most features in AEM are built in way, that each step (or each workflow instance, each Sling job, each replication event, etc) has its own nodes to operate upon. So concurrency must not provoke such a situation, that multiple concurrent operations compete for writes on a single node.
So from a coding perspective it should possible to avoid such situations. Not only because of this kind of problems, but also because of performance and debugging.

Something you cannot deal with in this way are author changes. If 2 authors decide to change a page at the same time, it’s likely that they screw up the page. You can hardly avoid that just using code. But if you cannot guarantee from a work organization point of view, that no 2 persons work at the same page at the same time, teach your authors to use the „lock“ feature. I basically prevents other authors from making changes temporarily. But according to the Oak documentation it isn’t suited to be used as short-living locks (in a database sense), but rather longer-living locks (author locks a page to prevent other authors from editing it).

So, to come a conclusion to Sandeeps question: It depends. If you designed your application carefully, you should rarely come into such situations, that you compete with multiple threads for a node. But whenever it occurs it should be considered as a bug, analyzed and then get fixed.
But there can be other cases, where this approach could make sense. In any case I would retry a few times (e.g. 10) and then break the operation with a meaningful log message. But I don’t think that it’s good to retry indefinitely.

AEM anti pattern: Long running sessions

AEM 6.x comes with Apache Oak, which features the use of the MVCC principle. MVCC (multi version concurrency control) is a principle, which gives you an view on a certain state within the repository. This state does not change, but can be considered as immutable. If you want to perform a change on the repository, the change is performed against this state and the applied (merged) to the HEAD state (which is the most current state within the repository). This merge is normally not a problem, if the state of the session doesn’t differ too much from the HEAD state; in case the merge fails, you get an OakMerge exception.

Note, that this is change compared to Jackrabbit 2.x and CRX 2.x, where the state of a session was always update, and where these merge exception never happened. This also means, that you might need to change your code to make it work well with Oak!

If you have long-running sessions, the probability of such an OakMerge exceptions is getting higher and higher. This is due to other changes happening in the repository, which could affect also the areads where your session wants to perform its changes. This is a problem especially in cases, where you run a service, which opens a session in the activate() method and closes it in deactivate() and uses it to save data to the repository as well. These are rare cases (because they are discouraged since years), but they still exist.

The problem is, that if a save() operations fails due to such an OakMerge exception, the temporary space of that session is polluted. The temporary space of a session is heap memory, where all the changes are stored, which are about to get saved. A successfully session.save() cleans that space afterwards, but if an exception happens this space is not cleaned. And if a session.save() fails because of such OakMerge exceptions, any subsequent session will fail as well.

Such an exception could like this (relevant parts only):

Caused by: javax.jcr.InvalidItemStateException: OakState0001: Unresolved conflicts in /content/geometrixx/en/services/jcr:content
at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:237)
at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:212)
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.newRepositoryException(SessionDelegate.java:672)
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.save(SessionDelegate.java:539)
at org.apache.jackrabbit.oak.jcr.delegate.ItemDelegate.save(ItemDelegate.java:141)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl$4.perform(ItemImpl.java:262)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl$4.perform(ItemImpl.java:259)
at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:294)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:113)
at org.apache.jackrabbit.oak.jcr.session.ItemImpl.save(ItemImpl.java:259)
at org.apache.jackrabbit.oak.jcr.session.NodeImpl.save(NodeImpl.java:99)

There are 2 ways to mitigate this problem:

  • Avoid long running sessions and replace them by a number of short-living sessions. This is the way to go and in most cases the easiest solution to implement. This also avoids the problems coming with shared sessions.
  • Add code to call session.refresh(true) before you do your changes. This refreshes the session state to the HEAD state, exceptions are less likely then. If you run into a RepositoryException you should explicitly cleanup your transient space using session.refresh(false); then you’ll loose your changes, but the next session.save() will not fail for sure. This the path you should choose when you cannot create new sessions.

1000 nodes per folder and Oak orderable nodes

Every now and then there’s the question, how many child nodes are supported in JCR. While the technical correct answer is „there is no limit“, in practice there are some limitations.

In CRX 2.x the nodes are always ordered. In CRX 2.x even unordered nodes are treated as if they are ordered, which made the difference nearly to non-existent. [Thanks Justin for making this clear!] This means, that the order needs to be maintained on all operations, including add and remove of sibling nodes. The more child nodes a node has, the more time it takes to maintain this list.

So, what’s about this „1000 child nodes“ limit? First of all, this number is arbitrary 🙂 But when you use CRXDE Lite, it’s getting really slow to browse a node with lots of child nodes, mostly because of the time it takes the Javascript to render it. But of course also the performance of add and remove operations degrade linearly. Also you don’t have hardly cases where you would have more than 1000 child nodes.

But for the aspect of reading nodes there is no impact on performance. So it is not a problem to have 6000 nodes in /libs/wcm/core/i18n/en, because you only read the nodes, but you don’t change them.

But nevertheless this „limit“ can be cumbersome, especially if you don’t need to the feature of ordered child nodes. Also the fact that there is this limit means, that adding you have the impact (at a a lower level) also already with less nodes.

With Apache Oak this has changed. With Oak nodes are not ordered unless its parent has node type which supports ordering.

To illiustrate the difference between sling:folder (no ordering required) and sling:orderedFolder (ordering required), I did a small test. I wrote a small benchmark to create 5000 nodes, then add more nodes, do random reads and delete them afterwards. For every operation a single node is created or deleted followed by a save(). (Sourcecode)

Operation sling:Folder sling:OrderedFolder
Create 5000 nodes 6124 ms 17129 ms
Random read 500 nodes 2 ms 9 ms
Add 500 nodes 112 ms 564 ms

This small benchmark (executed on 2014 Macbook pro with SSD, AEM 6.0, TarMK, Oak 1.0.0) shows:

  • Adding lots of child nodes to a node is much faster when you use a non-ordering nodetype
  • Also random read is faster, obviously Oak can use more efficient data structures than a list, if it doesn’t need to maintain the ordering.

The factor of 3-4 is obviously quite significant. Of course the benefit is smaller if you have less child nodes.