A recurring problem I see in AEM project implementations is the problem of missing abstraction. A lot of code deals passes around resources, ValueMaps and even Strings (paths). And because we are supposed to build software the proper way, the called method checks (or more often: not checks) that the provided resource parameter is not null, and that the resource is of the correct type.
But instead of dealing with resources, the class names and comments suggest that the code actually dealing with products. Or website structures. Or assets. But instead of using a “product” classes (or website class, or the provided asset class) still resources are used. The abstraction is missing!
For me the root cause of this problem is the CRXDE Lite. Exactly that thing which you can open on your local AEM instance at /crx/de/. Because it shows you a very nice hierarchical view to the repository, it shows you paths, and properties. And if a developer starts to build a mental model of something, this tool comes in quite handy. Because you can reach everything via path, which is a String! So instead of expressing relations between concepts I see often this:
is much easier to use (and did you spot the off-by-one bug in the String operation example? And what does happen if the path ends already with a slash?). But that still leaves the question, why you need to get the parent resource. Maybe a
is a more expressive way to describe the same. I would definitely prefer it.
So CRXDE is your biggest enemy when designing your application. If you are a seasoned AEM developer, my recommendation to you: Don’t explain your application with CRXDE. Rather use proper abstractions. Don’t do CRXDE driven development!
If that topic sounds familiar to you: I did a talk on the AdaptTo() conference 2020 regarding this topic, you can find the recording here. There I explain the problem in more detail, also including some better examples 🙂
In the last blog post I briefly talked about the basics what to consider when you are writing cluster-aware code. The essence is to be aware of your write activities, and make sure that the scheduled activities are running only on a single cluster node and not on many or all of them.
Today’s focus is on the behavior of JCR sessions with respect to clustering. From a conceptual point of view there is hardly a difference to a single-node cluster (or standalone instance), but the presence of more cluster nodes add a new angle of potential problems to it.
When I talk about JCR, I am thinking of the Apache Oak implementation, which is implemented on top of the MVCC pattern. (The previous Jackrabbit implementation is using a different approach, so this whole blog post does not apply there.) The basic principle of MVCC is that each session is clearly separated from any other session which is open in parallel. Also any changes performed on a session is not visible to other sessions unless
the other session is invoking session.refresh() or
the other session is opened after the mentioned session is closed.
This behavior applies to all sessions of a JCR repository, no matter if the are opened on the same cluster node or not. The following diagram visualizes this
We have 2 sessions A1 and B1 which are initiated at the same time t0, and which perform changes independently of each other on the repository, so session B1 cannot see the changes performed with A1_1 (and vice versa). At time t1 session A1 is refreshed, and now it can see the changes B1_1 and B1_2. And afterwards B1 is refreshed as well, and can now see the changes A1_1 and A1_2 as well.
But if a session is not refreshed (or closed and a new session is used), it will never see the changes which happened on the repository after the session has been opened.
As said before, these sessions do not need to run on 2 separate cluster nodes, you get the same behavior on a single cluster node as well. But I mentioned, that multiple cluster nodes are a special problem here. Why is that case?
That problem are OSGI services in the background, which perform a certain job and write data to the JCR repository. In a single-node cluster this not a problem, because all of these activities go through that single service; and if that service uses a long-running JCR session for it, that will never be a problem. Because this service is responsible for all changes, and the service can read and write all the relevant data. In a cluster with more than 2 nodes, each cluster node might have that service running, and the invocations of the services might be random. And as in the diagram above, on cluster node A the data A1_1 is written. And on cluster node 2 the data point B1_1 is written. But they don’t see each other’s changes if they don’t refresh the session! And in most applications, which are written for single-node AEM instances, session.refresh() is barely used, because in such situations there’s simply no need for it, as this problem never occurred.
So when you are migrating your application to AEM as a Cloud Service, review your applications and make sure that you find all long-running ResourceResolvers and JCR sessions. The best option is then to remove these long-running sessions and replace them with short-living ones, which are closed if the job is done. The second-best option is to introduce a session.refresh(), so the session sees any updates which happend to the repository in the meanwhile. (And btw: if you registering an ObservationListener in that session, you don’t need a manual refresh, as this refresh is done by the ObservationListener method anyway; what would it be for if not for reporting changes to the repository, which happen after opening the session?)
That’s all right now regarding cluster-aware coding. But I am sure that there is more to come 🙂
With AEM as a Cloud Service quite a number of small things have changed; and next to others you also get real clustering support in the authoring environment. Which is nice, because it gives you downtime-less authoring during deployments.
But this cluster also comes with a few gotchas, and one of them is that your application code needs to be cluster-aware. But what does that mean? What consequences does it have and what code do you have to change if you have never paid attention to this aspect?
The most important aspect is to do “every change only once“. It doesn’t make sense that 2 cluster nodes are importing the same set of data. A special version of this aspect is “avoid concurrent writes to the same node“, which can happen when a scheduled job is kicked off at the same time on all nodes, and this job is trying to change something in the repository. In this case you don’t only have overhead, but very likely a lot of exceptions.
And there is a similar aspect, which you should pay attention to: connections to external systems. If you have a cluster, running the same code and configs, it’s not always wanted that each cluster node reaches out to that external system. Maybe you need to the update it with the latest content only once, because it triggers some expensive processing on their side, and you don’t want to have that triggered two or three times, probably pretty much at the same time.
I have mentioned you 2 cases where a clustered application can be behave differently than a single-node environment, now let me show you how you can make your application cluster-aware.
Scheduled jobs are a classic tool to execute certain jobs at a certain time. Of course we could use the Sling Scheduler directly, but to make the execution more robust, you should wrap it into a Scheduled Sling Job.
Jobs give you the guarantee, that this job is going to be executed only at least once.
Use the Sling Scheduler only for very frequent jobs (e.g. once every 5 minutes), where it doesn’t matter if one execution is skipped, e.g. because the instance was just restarting. To limit the execution of such a job to a single node, you can annotate the job runner with this annotation:
In-memory caches are often used to speed up operations. Most often they contain the results of previous operations which are then reused; cache elements are either actively purged or expire using a time-to-live.
Normally such caches are not affected by clustering. They might contain different items with potentially different values in the cluster nodes, but that must never be a problem. If that is a problem, you have to look for a different approach, e.g. persisting the data to the repository (if they are not already coming from there) or externalizing the cache (e.g to a redis or memcached instance).
Also, having a simpler application instead of the highest-cache-hit ration possible is often a good trade-off.
Ok, these were the topics I wanted to discuss here. But expect a blog post about one of my favorite topics: “Long running sessions and clustering”.
In the last parts of this small series (part 1, part 2, part 3) I covered some basic approaches how you can use the Sling and AEM mocking libraries to ease writing unittests. The examples were quite basic and focussed, but in reality many test cases turn out to be much more complex.
And especially when your code has dependencies to other OSGI services, tests can get tricky. So today I want to walk you through some unittest I wrote some time ago, it’s a unittest for the EnsureOakIndex functionality (EnsureOakIndexJobHandlerTest).
The interesting part is that the required EnsureOakIndex service references 4 other services in total; if they are not present, my EnsureOakIndex service will never start properly. Thus you have to fullfill all service requirements of an OSGI service in the unittest as well (at least if you want to use SlingContext like I do here).
The easiest way to solve this is to rely on predefined services which are part of the SlingMocks or AemMocks. The second best way is to create simple mocks and register them a service, so the dependency is fulfilled. That’s definitely a convenient way if your tests do not invoke any of the service methods at all.
Thus the setup() method of my unittests are often pretty large, because there I prepare and inject all other services which I need to make my software-under-test work.
And because this setup works quite well and reliably, I always use AemContext for my unittests (or SlingContext, but as I haven not yet observed any difference in test execution time, I often prefer just AemContext because it comes with some more sevices). Just if I don’t need resources, nodes and no OSGI, I stick with plain junit. For everything else AemContext removes the necessity for mocking a lot.
I introduced SlingMocks in the recent blog posts (part 1, part 2), and till now we only covered some basics. Let’s dig deeper now and use one of its coolest features: Mocking resources.
Well, to be honest, we don’t mock resources. We do something much better: We build an in-memory structure which represent sling resources. We can even map that to an in-memory JCR-repository!
For this blog post I created a very simple Sling model, which supposed to build a classic navigation. For simplicity I omitted nearly all of the logic which makes up a production-ready navigation component, and want to add just the pages below the site root to it.
The interesting part is the unit test. This time I used the AemMocks library from wcm.io, because it provides all the magic of the SlingMocks, plus some AEM specific objects I would like to use. As the AemContext class inherits directly from the SlingContext class, we can just use it as a drop-in replacement.
The setup method is very simple: It loads a JSON structure and populates a in-memory resource tree with it (adds everything below a resource “/content”). Using this approach there’s no longer the need to mock resources, properties/value maps, and mock the relation between these. This is all done by the SlingMocks/AemMocks framework.
Thus a testcase looks like this:
This simple example shows how easy it can be to write unit tests. The test code is very concise and easy to read and understand. In my opinion the biggest issue when writing unit tests now is creating the test content 🙂
Some notes to the test content:
My personal style is to put the test content into the same structure than the java packages. This makes it very easy to find the testcontent and keeps the testcontent and the unittest together.
You can write the JSON manually. But you can also use CRXDE Lite to create some structure in your local AEM instance and then export it with “http://localhost:4502/content/myproject/testcontent.tidy.-1.json ” and paste it into the JSON file. I find this way much more convenient, especially if you already have most of the test content ready. But if you do that, please clean up the unnecessary properties in the test data. They are polluting the test content and make it harder to understand.
In the previous post I showed you how easy it is to start using SlingContext. If you start to use this approach in your own project and just copy the code, your Maven build is likely to fail with messages like this:
[ERROR] testActivate_unconfiguredParams(de.joerghoh.cqdump.samples.unittest.ReplicationServletTest) Time elapsed: 0.01 s <<< ERROR! org.apache.sling.testing.mock.osgi.NoScrMetadataException: No OSGi SCR metadata found for class de.joerghoh.cqdump.samples.unittest.ReplicationServlet
The problem here is not the use of the SlingMock library itself, but rather the fact that we use SlingMocks to test code which uses OSGI annotations. The fix itself is quite straight-forward: We need to create OSGI metadata also for the unittests and not only for the bundling (SlingMock is reading these metadata for the test execution).
That’s the reason why there’s the need to have a dedicated execution definition for the maven-bundle-plugin:
Also you need to instruct the maven-bundle-plugin to actually export the generated metadata to the filesystem using the “<exportScr> statements; this should be the default in my opinion, but you need to specify it explicitly. Also don’t forget to add the “_dsannotation” and “_metatypeannotations” statements to its instruction section:
And even then it will fail, if you don’t upgrade the maven-bundle-plugin to a version later than 4.0:
Ok, if you adapted your POMs in this way, your SlingMock based unittests for OSGI r6-based services should run fine. Now you can start exploring more features of SlingMocks.
But with SlingMocks we also have much better unittest tooling (thanks a lot to Stefan Seifert and the Sling people), so we much more and better/easier tooling than just Mockito and Powermock. SlingContext (and its extension AemContext) allows you create unit tests quite easily. Using them can help you get rid of mocking Sling Resources, repo access and many things more.
On top of that, SlingMock can easily work with the new OSGI r6 annotations, which allow you to define OSGI properties in Pojos. Mocking the @ObjectClassDefinition classes isn’t that easy, because they are essentially annotations …
To illustrate that, I have created a minimal demo with a servlet and testcases for it (source code at Github). Basically the functionality of the class itself is not relevant for this article, but we want to focus on aspects how you utilize the frameworks best to avoid boilerplate code.
The code is quite simple, but to make unittesting a bit more challenging, it uses the new OSGI r6 annotations for OSGI configuration (using the @Designate and the @ObjectClassDefinition annotations) plus a referenced service.
If you try to mock the ReplicationServlet.Config class the naive way, you will find out, that it’s an annotation, which is referenced in the activate() method. I always failed to mock it somehow, so I switched gear and started to use SlingMock for it. (I don’t want to say that it is not possible, but it’s definitly not straight-foward, and in my opinion writing unit-tests should be straight forward, otherwise they are not created at all.)
With SlingMocks the approach changes. I am not required to create mocks, but SlingMocks provides a mocked OSGI runtime we can use. That means, that we create the OSGI parameters as a map to tell the SlingContext object to register our service with these parameters (line 59).
Because SlingContext implements quite a bit of the OSGI semantics, it also requires that all referenced services are available (if these are static references). Therefor I use Mockito to mock the Replicator and I register the mock to provide the Replicator service. In realworld I could verify the interactions of my servlet with that mock.
This basic example illustrates how you can use SlingMocks to avoid a lot of mocking and stubbing. This example does not utilize the full power of SlingMocks yet, we are just scratching at the surface. But we already have some benefit : If you switch from SCR annotations to OSGI annotations, your SlingMock unittests don’t need any change, because it provides an OSGI-like environment, and there the way how metatypes are generated and injected are abstracted away.
Since the beginning of AEM development we used annotations to declare OSGI services; @Component, @Service, @Property and @Reference should be known to everyone how has ever developed backend stuff for AEM. The implementation behind these annotations came from the Apache Felix project, and they were called the SCR annotations (SCR = Service Component Runtime). But unlike the Service Component Runtime, which is part of the OSGI standard for quite some, these annotations were not standardized. This changed with OSGI Release 6.
With this release annotations were also standardized, but they are 100% compatible to the SCR annotations. And there are a lot of resources out there, which can help to explain the differences:
(And there are definitly much more resources out there).
I recently worked on migrating a lot of the code from ACS AEM Commons from SCR annotations to OSGI annotations, and I want to share some learning I gained on the way. Because in some subtle areas the conversion isn’t that easy.
Mixed use of SCR annotations and OSGI annotations
You can mix SCR annotations and OSGI annotations in a project, you don’t need to migrate them all at once. But you can to be consistent on a class level, you cannot mix SCR and OSGI annotations in a single class. This is achieved by an extension to the maven-bundle-plugin (see below).
SCR property annotations give you a lot of freedom. You can annotate them on top of the class (using the @Properties annotation as container with nested @Property annotations), you can annotate individual constant values to be properties. You can make them visible in the OSGI webconsole (technically you are creating a metatype for them), or you can mark them as private (no metatype is created).
With OSGI annotations this is different.
Metatype properties are handled in the dedicated configuration class marked with @ObjectClassDefinition. They cannot be private.
Properties which are considered to be private are attached to the @Component annotation. They cannot be changed anymore.
A limitation from a backward compatibility point of view: With SCR annotations you are not limited in the naming of properties, next to characters often the “.” (dot) and the “-” (dash, minus) was used. With OSGI r6 annotations you can easily create a property with a “.” in it
String before_after() default "something";
will result in the property with the name “before.after”; but with OSGI r6 annotations you cannot create properties with a “-” in it. Only OSGI r7 (which is supported in AEM 6.4 onwards) supports it with a construct like this:
String before$_$after() default "something";
If you want to keep compatibility with AEM 6.3, expect the breakage of property names or you need to investigate in workarounds (see #1631 of ACS AEM Commons). But my recommendation is to avoid the use of the “-” in property names alltogether and harmonize this in your project.
Update: I posted an additional blog post specifically on migrating SCR properties, mostly in the context of OSGI DS and OSGI Metatypes.
Labels & description
All the metatype stuff (that means, how OSGI configurations appear in the /system/console/configMgr view) is handled on the level of the @ObjectClassDefinition annotation and the method annotated with it. With the SCR annotations this was all mixed up between the @Component annotation and the @Property fields.
Update the tooling to make it work
If you want to work with OSGI annotations, you should update some elements in your POM as well:
Update the maven-bundle-plugin to 4.1.0
Remove the dependency to the maven-scr-plugin
Add a dependency to org.osgi:org.osgi.annotations:6.0.0 to your POM.
The interesting part is here is the plugin to the maven-bundle-plugin, which can also handle SCR annotations; this statement allows you to mix both types of annotations.
This blog post should have given you some hints how you migrate the SCR annotations of an existing codebase to OSGI annotations. It’s definitly not a hard task, but some details can be tricky. Therefor it’s cool if you have the chance to mix both types of annotations, so you don’t need a big-bang migration for this.
A problem I encounter every now and then are leaking JCR sessions; that means that JCR sessions are opened, but never closed, but just abandoned. Like Files, JCR sessions need to be closed, otherwise their memory is not freed and they cannot be garbage collected by the JVM. Depending on the number of sessions you leave in that state this can lead to serious memory problems, ultimately leading to a crash of the JVM because of an OutOfMemory situation.
(And just to be on the safe side: In AEM ootb all ResourceResolvers use a JCR session internally; that means whatever I just said about JCR sessions applies the same way to Sling ResourceResolvers.)
So it’s very easy to find out if you don’t have a memory leak because of leaking JCR sessions: The number of log statements for registration of these mbeans must match the number of log statements for unregistration.
In many cases you probably don’t have exact matches. But that’s not a big problem if you consider:
On AEM startup a lot of sessions are opened and JCR observation listeners are registered to them. That means that a logfile with AEM starts and stops (and the number of starts do not match the number of stops) it’s very likely that these numbers do not match. Not a problem.
The registration (and also the unregistration) of these mbeans often happens in batches; if this happen during logfile rotation, you might have an imbalance, too. Again, not per se a problem.
It’s getting a problem, if the number of sessions opened is always bigger than the number of sessions closed over the course of a few days.
Here I just have the log data of a single day, and it’s very obvious, that there is a problem, as around 220k sessions are opened but never closed. On a single day!
To estimate the effect of this, we need to consider that for every of these log statements these objects are retained:
A JCR session (plus objects it reaches, and depending on the activities happening in this session it might also include any pending change, which will never going to be persisted)
A Mbean (referencing this session)
So if we assume that 1kb of memory is associated with every leaking session (and that’s probably an very optimistic assumption), this would mean that the system above would loose around 220M of heap memory every day. This system probably requires a restart every few days.
How can we find out what is causing this memory leak? Here it helps, that Oak stores the stack trace when opening sesions as part of the session object. Since around Oak 1.4 it’s only done if the number of open sessions exceeds 1000; you can tune this value with the system property “oak.sessionStats.initStackTraceThreshold”; set it to the appropriate value. This is a great help to find out where the session is opened.
And then go to /system/console/jmx, check for the “SessionStatistics” mbeans (typically quite at the bottom of the list) and select on the most recent ones (they have the openening date already in the name)
And then you can find in the “initStackTrace” the trace where this session has been opened:
With the information at hand where the session has been opened it should be obvious for you to find the right spot where to close the session.
If you spot a place where a session is opened in AEM product code but never closed, please check that with Adobe support. But be aware, that during system startup sessions are opened and will stay open while the system is running. That’s not a problem at all, and please do not report them!
It’s only a problem if you have a at least a few hundreds session open with the very same stack trace, that’s a good indication of such a “leaking session” problem.
A good followup reading on AEM HelpX pages with some details how you can fix it.
OSGI as component model is one of the cores of AEM, as it allows to dynamically register and consume services offered by other parts of the system. It’s the central registry you can ask for all kind of services.
If you have some weeks of CQ experience as developer, you probably already know the mechanics how to access a product service. For example one of the most often used statements in a service (or component) is:
which gives you access to the SlingRepository service, through which you can reach the JCR. Technically spoken, you build a static reference. So your service gets active only when this reference can be resolved. By this you can rely on the repository being available whenever your service is running. This is a constraint which is not a problem in many cases. Because it wouldn’t make sense for your service to run without the repository, and it also frees to permanently checking “repo” for being not null 🙂
Sometimes you don’t want to wait for a reference to be resolved (maybe breaking a dependency loop) or you can just deliver additional value if a certain (optional) service is available. In such cases you can make it a dynamic reference
Now there’s no hard dependency to the SlingRepository service; so your service might get active before the SlingRepository service is available, and therefor you need to handle the case that “repo” is null.
Per se this feature might have little importance to you, but combining it with other aspects makes it really powerful. More on that in the next post…