Sling Scheduled Jobs vs Sling Scheduler

Apache Sling and AEM provide 2 different approaches to start processes at a given time or in a given interval. It is not always trivial to make the right decision between these two, and I have seen a few cases of misuse already. Let’s dive into this topic and I will outline in what situation to use the Scheduler and when to use Scheduled Jobs.

Continue reading “Sling Scheduled Jobs vs Sling Scheduler”

How to analyze “Authentication support missing”

Errors and problems in running software manifest often in very interesting and non-obvious cases. A problem in location A manifests itself only with an unrelated error message in a different location B.

We also have one example of such a situation in AEM, and that’s the famous “Authentication support missing” error message.  I see often the question “I got this error message; what should I do now?”, and so I decided: It’s time to write a blog post about it. Here you are.

“Authentication support missing” is actually not even correct: There is no authentication module available, so you cannot authenticate. But in 99,99% of the cases this is just a symptom. Because the default AEM authentication depends on a running SlingRepository service. And a running Sling repository has a number of dependencies itself.

I want to highlight 2 of these dependencies, because they tend to cause problems most often: The Oak repository and the RepositoryInitializer service. Both must be up and be started/run succesfully until the SlingRepository service is being registered succesfully. Let’s look into each of these dependencies.

The Oak repository

The Oak repository is a quite complex system in itself, and there are many reasons why it did not start. To name a few:

  • Consistency problems with the repository files on disk (for whatever reasons), permission problems on the filesystem, full disks, …
  • Connectivity issues towards the storage (especially if you use a database or mongodb as storage)
  • Messed up configuration

If you have an “authentication support missing” message, you first check should be on the Oak repository, typically reachable in the AEM error.log. If you have an ERROR messages logged by any “org.apache.jackrabbit.oak” class during the startup, this is most likely the culprit. Investigate from there.

Sling Repository Initializer (a.k.a. “repoinit”)

Repoinit is designed to ensure that a certain structure in the repository is provided, even before any consumer is accessing it. All of the available scripts must be executed, and any failure will immediate terminate the startup of the SlingRepositoryService. Check also my latest blog post on Sling Repository Initializer for details how to prevent such problems.

Repoinit failures are typically quite prominent in the AEM error.log, just search for an ERROR message starting with this:

*ERROR* [Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted …

These are 2 biggest contributors to this “Authentication support missing” error messages. Of course there are more reasons why it could appear. But to be honest, I only have seen these 2 cases in the last years.

I hope that this article helps you to investigate such situations more swiftly.

How to deal with RepoInit failures in Cloud Service

Some years, even before AEM as a Cloud Services, the RepoInit language has been implemented as part of Sling (and AEM) to create repository structures directly on the startup of the JCR Repository. With it your application can rely that some well-defined structures are always available.

In this blog post I want to walk you through a way how you can test repoinit statements locally and avoid pipeline failures because of it.

Repoinit statements are deployed as part of OSGI configurations; and that means that during the development phase you can work in an almost interactive way with it. Also exceptions are not a problem; you can fix the statement and retry.

The situation is much different when you already have repoinit statements deployed and you startup your AEM (to be exact: the Sling Repository service) again. Because in this case all repoinit statements are executed as part of the startup of the repository. And any exception in the execution of repoinits will stop the startup of the repository service and render your AEM unusable. In the case of CloudManager and AEM as a Cloud Service this will break your deployment.

Let me walk you through 2 examples of such an exception and how you can deal with it.

*ERROR* [Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted java.lang.RuntimeException: Session.save failed: javax.jcr.nodetype.ConstraintViolationException: OakConstraint0025: /conf/site/configuration/favicon.ico[[nt:file]]: Mandatory child node jcr:content not found in a new node 
at org.apache.sling.jcr.repoinit.impl.AclVisitor.visitCreatePath(AclVisitor.java:167) [org.apache.sling.jcr.repoinit:1.1.36] 
at org.apache.sling.repoinit.parser.operations.CreatePath.accept(CreatePath.java:71)

In this case the exception is quite detailed what actually went wrong. It failed when saving, and it says that /conf/site/configuration/favicon (of type nt:file) was affected. The problem is that a mandatory child node “jcr:content” is missing.

Why is it a problem? Because every node of nodetype “nt:file” requires a “jcr:content” child node which actually holds the binary.

This is a case which you can detect very easily also on a local environment.

Which leads to the first recommendation:

When you develop in your local environment, you should apply all repoinit statements to a fresh environment, in which there are no manual changes. Because otherwise your repoinit statements rely on the presence of some things which are not provided by the repoinit scripts.

Having a mix of manual changes and repoinit on a local development environment and then moving it untested over is often leads to failures in the CloudManager pipelines.

The second example is a very prominent one, and I see it very often:

[Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted java.lang.RuntimeException: Failed to set ACL (java.lang.UnsupportedOperationException: This builder is read-only.) AclLine DENY {paths=[/libs/cq/core/content/tools], privileges=[jcr:read]} 
at org.apache.sling.jcr.repoinit.impl.AclVisitor.setAcl(AclVisitor.java:85)

It’s the well-known “This builder is read-only” version. To understand the problem and its resolution, I need to explain a bit the way the build process assembles AEM images in the CloudManager pipeline.

In AEM as a cloud service you have an immutable part of the repository, which consists out of the trees “/libs” and “/apps”. They are immutable, because they cannot be modified on runtime, not even with admin permissions.

During build time this immutable part of the image is built. This process merges both product side parts (/libs) and custom application parts (/apps) together. After that also all repoinit scripts run, both the ones provided by the product as well as any custom one. And of course during that part of the build these parts are writable, thus writing into /apps using repoinit is not a problem.

So why do you actually get this exception, when /libs and /apps are writeable? This is because repoinit is executed a second time. During the “final” startup, when /apps and /libs are immutable.

Repoinit is designed around that idea, that all activities are idempotent. This means that if you want to create an ACL on /apps/myapp/foo/bar the repoinit statement is a no-op if that specific ACL already exists. A second run of repoinit will do nothing, but find everything still in place.

But if in the second run the system executes this action again, it’s not an no-op anymore. This means that this ACL is not there as expected. Or whatever the goal of that repoinit statement was.

And there is only one reason why this happen. There was some other action between these 2 executions of repoinit which changed the repository. The only thing which also modifies the repository are installations of content packages.

Let’s illustrate this problem with an example. Imagine you have this repoinit script:

create path /apps/myapp/foo/bar
set acl on /apps/myapp/foo/bar
  allow jcr:read for mygroup
end

And you have a content package which comes with content for /apps/myapp and the filter is set to “overwrite”, but not containing this ACL.

In this case the operations leading to this error are these:

  • Repoinit sets the ACL on /apps/myapp/foo/bar
  • the deployment overwrites /apps/myapp with the content package, so the ACL is wiped
  • AEM starts up
  • Repoinit wants to set the ACL on /apps/myapp/foo/bar, which is now immutable. It fails and breaks your deployment.

The solution to this problem is simple: You need to adjust the repoinit statements and the package definitions (especially the filter definitions) in a way, that the package installation does not wipe and/or overwrite any structure created by repoinit. And with “structure” I do not refer only to nodes, but also nodetypes, properties etc. All must be identical, and in the best case they don’t interfere.

It is hard to validate this locally, as you don’t have an immutable /apps and /libs, but there is a test approach which comes very close to it:

  • Run all your repoinit statements in your local test environment
  • Install all your content packages
  • Enable write tracing (see my blog post)
  • Re-run all your repo-init statements.
  • Disable write tracing again

During the second run of the repoinit statements you should not see any write in the trace log. If you have any write operation, it’s a sign that your packages overwrite structures created by repoinit. You should fix these asap, because they will later break your CloudManager pipeline.

With this information at hand you should be able to troubleshoot any repoinit problems already on your local test environment, avoiding pipeline failures because of it.

AEM micro-optimization (part 4) – define allowed templates

This time I want to discuss a different type of micro-optimization. It’s not something you as a developer can implement in your code, but it’s rather a question of the application design, which  has some surprising impact. I came across it when I recently investigated poor performance in the Siteadmin navigation. And although I did this investigation in AEM as a Cloud Service, the logic on AEM 6.5 behaves the same way.

When you click in the siteadmin navigation through your pages, AEM collects a lot of information about pages and folders to display them in the proper context. For example, when you click on page with child pages, it collects information what actions should be displayed if a specific child node is going to be selected (copy, paste, publish, …)

An important information is if the “Create page” action should be made available. And that’s the thing I want to outline in this article.

Screenshot: “Create” dialog

Assuming that you have the required write permissions on that folder, the most important is if templates are allowed to be created as children of the current page. The logic is described in the documentation and is quite complex.

In short:

  • On the content the template must be allowed (using the cq:allowedTemplates property (if present) AND
  • The template must be allowed to be used as a child page of the current page

Both conditions are must be met for a template to make it eligible to be used as a source for a new page. To display the entry “Page” it’s sufficient if at least 1 template is allowed.

Now let’s think about the runtime performance of this check, and that’s mostly determined by the total number of templates in the system. AEM determines all templates by this JCR query:

//jcr:content/element(*,cq:Template)

And that query returns 92 results on my local SDK instance with WKND installed. If we look a bit more closely to the results, we can determine 3 different types of templates:

  • Static templates
  • Editable templates
  • Content Fragment models

So depending on your use-case it’s easy to end up with hundreds of templates, and not all of them are applicable at the location you are currently in. In fact, typically just very few templates can be used to create a page here. That means that the check most likely needs to iterate a lot to eventually encounter a template which is a match.

Let’s come back to the evaluation if that entry should be displayed. If you have defined the cq:allowedTemplates property  on the page or it’s ancestors it’s sufficient to check the templates listed there. Typically it’s just a handful of templates, and it’s very likely that you find a “hit” early on, which immediately terminates this check with a positive result. I want to explicitly mention that not every template listed can be created here, because there also other constraints (e.g. the parent template must be of a certain type etc) which must match.

 If template A is allowed to be used below /content/wknd/en, then we just need to check the single Template A to get that hit. We don’t care, where in the list of templates it is (which are returned by the above query), because we know exactly which one(s) to look at.

If that property is not present, AEM needs to go through all templates and check the conditions for each and every one, until it finds that positive result.  And the list of templates is identical to the order in which the templates are returned from the JCR query, that means the order is not deterministic. Also it is not possible to order the result in a helpful way, because the semantic of our check (which include regular expressions) cannot be expressed as part of the JCR query.

So you are very lucky if the JCR query returns a matching template already at position 1 of the list, but that’s very unlikely. Typically you need to iterate tens of templates to get a hit.

So, what’s the impact on the performance of this iteration and the checks? In an synthetic check with 200 templates, when I did not have any match, it took around 3-5ms to iterate and check all of the results.

You might ask, “I really don’t feel a 3-5ms delay”, but when the list view in siteadmin performs this check for up to 40 pages in a single request, it’s rather a 120-200 millisecond difference. And that is a significant delay for requests where bad performance is visible immediately. Especially if there’s a simple way to mitigate this.

And for that reason I recommend you to provide “cq:allowedTemplates” properties on your content structure. In many cases it’s possible and it will speed up the siteadmin navigation performance.

And for those, who cannot change that: I currently working on changing the logic to speedup the processing for the cases where no cq:allowedTemplates property is applicable. And if you are on AEM as a Cloud Service, you’ll get this improvement automatically.

AEM micro-optimization (part 1)

As a followup on the previous article I want to show you, how a micro-optimization can look like.  My colleague Miroslav Smiljanic found that there is a significant difference in the time it takes to compute these statements (1) and (2).

Node node = …
Session session = node.getSession();
String parentPath = node.getParent().getPath();

Node p1 = node.getParent(); // (1)
Node p2 = session.getNode(parentPath); // (2)

assertEquals(p1,p2);

He did the whole writeup in the context of a suggested improvement in Sling, and proved it with impressive numbers.

Is this change important? Just by itself it is not, because going the resource/node tree upwards is not that common compared to going downwards the tree. So replacing a single call might yield only in an improvement of a fraction of a milisecond, even if the case (2) is up to 200 times faster than (1)!

But if we can replace the code in all cases where the getParent() can be used with the performant getParent() call, especially in the lowlevel areas of AEM and Sling, all areas might benefit from it. And then we don’t execute it only once per page rendering, but maybe a hundred times. And then we might end up with tens of miliseconds of improvement already, for any request!

And in special usecases the effect might be even higher (for example if your code is constantly traversing the tree upwards).

Another example of such an micro-optimization, which is normally quite insignificant but can yield huge benefits in special cases can be found in SLING-10269, where I found that a built-in caching of the isResourceType() results reduces the rendering times of some special requests by 50%, because it is done thousands of times.

Typically micro-optimizations have these properties:

  • In the general case the improvement is barely visible (< 1% improvement of performance)
  • In edge cases they can be a life saver, because they reduce execution time by a much larger percentage.

The interesting part is, that these improvements accumulate over time, and that’s where it is getting interesting. When you have implemented 10 of these in low-level routines the chances are high that your usecase benefits from it as well. Maybe by 10 times 0.5% performance improvement, but maybe also a 20% improvement, because you hit the sweet spot of one of these.

So it is definitely worth to pay attention to these improvements.

My recommendation for you: Read the entry in the Oak “Do’s and Don’ts” page and try to implement this learning in your codebase. And if you find more of such cases in the Sling codebase the community appreciates a ticket.

(Photo by KAL VISUALS on Unsplash)

The effect of micro-optimizations

Optimizing software for speed is a delicate topic. Often you hear the saying “Make it work, make it right, make it fast”, implying performance optimization should be the last step you should do when you code. Which is true to a very large extent.

But in many cases, you are happy if your budget allows to you to get to the “get it right” phase, and you rarely get the chance to kick off a decent performance optimization phase. That’s a situation which is true in many areas of the software industry, and performance optimization is often only done when absolutely necessary. Which is unfortunate, because it leaves us with a lot of software, which has performance problems. And in many cases a large part of the problem could be avoided if only a few optimizations were done (at the right spot, of course).

But all this statement of “performance improvement phase” assumes, that it requires huge efforts to make software more performant. Which in general is true, but there are typically a number of actions, which can be implemented quite easily and which can be beneficial. Of course these rarely boost your overall application performance by 50%, but most often it just speeds up certain operations. But depending on the frequency these operations are called it can sum into a substantial improvement.

I did once a performance tuning session on an AEM publish instance to improve the raw page rendering performance of an application. The goal was to squeeze more page responses out of the given hardware. Using a performance test and a profiler I was able to find the creation of JCR sessions and Sling ResourceResolvers to take 1-2 milliseconds, which was worth to investigate. Armed with this knowledge I combed through the codebase, reviewed all cases where a new Session is being created and removed all cases where it was not necessary. This was really a micro-optimization, because I focussed on tiny pieces of the code (not even the areas which are called many times) , and the regular page rendering (on a developer machine) was not improving at all. But in production this optimization turned out to help a lot, because it allowed us to deliver 20% more pages per second out of the publish at peak.

In this case I spend quite some amount of time to come to the conclusion, that opening sessions can be expensive under load. But now I know that and spread that knowledge via code reviews and blog posts.

Most often you don’t see the negative effect of these anti-patterns (unless you overdo it and every Sling Models opens a new ResourceResolver), and therefor the positive effects of applying these micro-optimizations are not immediately visible. And in the end, applying 10 micro-optimizations with a ~1% speedup each sum up to a pretty nice number.

And of course: If you can apply such a micro-optimization in a codepath which is heavily used, the effects can be even larger!

So my recommendation to you: If you come across such a piece of code, optimize it. Even if you cannot quantify and measure the immediate performance benefit of it, do it.

Same as:

(for int=0;i<= 100;i++) {
  othernumber += i;
}

I cannot quantify the improvement, but I know, that

othernumber += 5050;

is faster than the loop, no questions asked. (Although that’s a bad example, because hopefully the compiler would do it for me.)

In the upcoming blog posts I want to show you a few cases of such micro-optimizations in AEM, which I personally used with good success. Stay tuned.

(Photo by Michael Longmire on Unsplash)

Writing integration tests for AEM, part 5

This a part of my ongoing series about writing integration tests with AEM.

Integration tests help you to keep control
Photo by Chris Leipelt on Unsplash

Writing tests seems to be a recurring topic 🙂 This week I wrote some integration tests which included one of the most important workflows in AEM: Activation of pages. Right now haven’t blogged about the handling of both author and publish in an integration test. I will show you how to do it.

So let’s assume that you want to do some product testing and validate that replication is working and also writes correct audit log entries. This should be covered with an integration test. You can find the complete sourcecode in the ActivatePageIT at the integrationtests github project.

Before we dig into the code itself, a small hint for the development phase of tests. If you can want to execute only a single integration tests, you can instruct maven to do this with the parameter “-Dit.test=<Name of the testclass>”. So in our case the complete maven command line looks like this:

mvn clean install -Peaas-local -Dit.test=ActivatePageIT -Dit.author.url=http://localhost:4502

(assuming that you don’t run your AEM author on same port as I do … if you want to change that, modify the parameters in the pom.xml).

On the coding side, the approach follows of every integration test: we need to get the correct clients first:

As we want to use replication, we use a ReplicationClient, which is provided by the testing client library.

Next we define use a custom Page class, which allows us to define the parentPath:

Then the actual test case is straight forward.

I used some more features of the testing clients to just test the existence or absence of the page, plus the doGetJson() method to get the JSON representation of the pages (in the getAuditEntries() method).

So, writing integration tests with this tooling at hand is easy and actual fun. Especially if the test code is straight forward to implement like here.

AEM as a Cloud Service and the handling of binaries

When you are long-time user of AEM 6.x (and even CQ5), you are probably familiar with the Asset Update workflow. The primary task of it is the extraction of metadata from the binary asset and the creation of (smaller) renditions for it. This workflow is normally executed on the AEM authoring instance.

“Never underestimate the bandwidth …!” (symbolic photo)
Photo by Massimo Botturi on Unsplash

But since the begin of this approach it is plagued with problems:

  • The question of supported filetypes. Given the almost unlimited amount of file formats and their often proprietary implementation, it’s not always possible to perform these operations. In many cases, the support of these file types within Java is poor.
  • Additionally, depending on the size and the type of the asset and the quality of the library which provides support for this filetype, the processing can be very time consuming and also consume a lot of heap. Imagine that you can want to create renditions of a TIFF file which has dimensions of 10k * 10k pixels (assuming that you have a 24bit resolution) this requires 300 megabyte of contininous heap to store an uncompressed version of it. You have to size the heap size accordingly, otherwise you will run out of memory (OOM).
  • To avoid these issues, for many filetypes external tools like imagemagick were used, which both come with support of various image types (in many cases much better than the Java Image library), plus the ability not to blow the AEM process when the process fails (because imagemagick runs in a dedicated process). But also the capabilities of imagemagick are limited, and the support for more exotic (non-image) file types could be better.
  • In all cases you need to size your hardware for a worst case scenario. For example you need to provision a lot of heap, if your authors might start to ingest large images. And you need to provision enough CPU to mitigate negative impacts on all other operations.
  • Another big problem is the latency. Assuming that your asset is very large (it’s not uncommon to have assets larger than 1 Gigabyte), it takes time to copy the binary from the (remote) datastore to a location where the processing takes place. Even if you can transfer 100 MiB per second, it needs 10 seconds to have the file transferred to the local disk; normally this process runs through the AEM JVM, which is problematic in terms of heap usage, and also can cause performance problems. Not to mention code, which is not aware of the possible sizes and tries to load the complete stream into memory.

In AEM as a Cloud Service this is offloaded, and that’s what AssetCompute is for. It performs all these steps on its own; also not using imagemagick for image handling, but high quality and optimized routines which also power other Adobe products.

But what does that mean for you as developer for AEM as a Cloud Service? In the first place, it does not have any impact. But you should learn a few things from it:

  • Do not create any renditions on your own, use assetCompute instead. This service is extensible (checkout Project Firefly), so you can do all kind of asset operations there. There is no need anymore to use the java image library code.
  • Avoid streaming binary data through AEM. AEM as a Cloud Service itself (the JVM) should not be bothered with streaming binary data into and out of the JVM. If you want to upload files into AEM, you should use the aem-upload library

In general, think twice before you open an InputStream in AEM (either via Rendition.getStream() or also via the JCR API). Normally you never know how much data is behind it, and for almost all transformation cases it makes sense to use AssetCompute to perform these.

Long running sessions and SegmentNotFoundExceptions

If you search this blog, you find one recurring theme over the years: The lifecycle of JCR sessions and Sling ResourceResolvers. That you should not keep them open for a long time. And that you definitely have to close them. But I never gave you an example what can happen if you don’t follow this recommendation. Until now.

These days I learned that was is actual problem which can arise because of it. And the problem is called “SegmentNotFoundException”.

In the past a SegmentNotFoundException was a clear indication of a corrupt JCR repository. The recommendation was always either to fix it or to restore from backup. Both operations are tedious, require downtimes and possibly also mean a loss of data. That’s probably also the reason why this specific problem is often taken for the sign of such a repository exception. So let’s systematically look at it.

The root cause

With AEM 6.4 the feature of “tail-compaction” was introduced, which is a version of the online compaction feature. It is less efficient but takes less time than the full compaction. By default in AEM the tail compaction runs daily and the online compaction once a week.

But from what I understood, this tail compaction has a problem with long running sessions, and it can happen, that tar files are compacted and removed, which are still referenced. That means, that it’s not really a on-disk corruption which needs to be fixed, but rather that some “old sessions” (read about MVCC in the previous post) are referencing data which is not there (anymore).

An unclosed session – a symbol photo (by engin akyurt on Unsplash)

Validate the symptoms

This problem I describe in this post happens under some special circumstances, which you should check first before you start the hunt for long-running sessions:

  • You get SegmentNotFoundExceptions (always with the same segment ID).
  • A repository check doesn’t find any inconsistency.
  • If you restart the instance, the error is gone, but appears again after some time (mostly at least a day).
  • You are running AEM 6.4 or AEM 6.5 (SP doesn’t seem to matter).

In the case I observed, only a single workflow step was affected, but not all the time and only after some time, which made me believe that it was related to the compaction. But it was very hard to track down the error, because the workflow step itself was complex, but safe.

The solution

Fix any long-running session in your application (unless you are registering an ObservationListener in there, which takes care of the refreshs by design). Really all. Use the JMX webconsole plugin and check the list of registered session mbeans every day on a production instance. Count them. Look at the timestamps when the session was opened.

 In the case I observed, the long running session was in a different area of the application, but was working on the same data (user profiles) as the failing workflow step. But the 2 areas in the code were totally unrelated to each other, so that was the only way to track it down.

Final words

Some other notes, which I consider as important in this context:

  • When you encounter a SegmentNotFoundException, please always open a support ticket, just in case. If it’s a different issue than described here, it’s better if you have that ticket open already.
  • If you see exactly this issue, and changing your application code makes this problem go away, please also raise a support ticket. That bug should get fixed (even if long-running sessions are not recommended since years).
  • As mentioned, when you encounter this issue, it’s not a persisted corruption. Restarting will cause the issue not to appear for some time, but that should only buy you time to identify and fix the long running sessions.
  • And AEM as a Cloud Service is not affected by this problem, because neither Online Compaction nor Tail Compaction are used. Instead the Golden Master is offline compacted before cloning.

Long running sessions and clustering

In the last blog post I briefly talked about the basics what to consider when you are writing cluster-aware code. The essence is to be aware of your write activities, and make sure that the scheduled activities are running only on a single cluster node and not on many or all of them.

Today’s focus is on the behavior of JCR sessions with respect to clustering. From a conceptual point of view there is hardly a difference to a single-node cluster (or standalone instance), but the presence of more cluster nodes add a new angle of potential problems to it.

When I talk about JCR, I am thinking of the Apache Oak implementation, which is implemented on top of the MVCC pattern. (The previous Jackrabbit implementation is using a different approach, so this whole blog post does not apply there.) The basic principle of MVCC is that each session is clearly separated from any other session which is open in parallel. Also any changes performed on a session is not visible to other sessions unless

  • the other session is invoking session.refresh() or
  • the other session is opened after the mentioned session is closed.

This behavior applies to all sessions of a JCR repository, no matter if the are opened on the same cluster node or not. The following diagram visualizes this

Diagram showing how 2 sessions are performing changes to the repository whithout seeing the changes of the other as long as they don’t use session.refresh()

We have 2 sessions A1 and B1 which are initiated at the same time t0, and which perform changes independently of each other on the repository, so session B1 cannot see the changes performed with A1_1 (and vice versa). At time t1 session A1 is refreshed, and now it can see the changes B1_1 and B1_2. And afterwards B1 is refreshed as well, and can now see the changes A1_1 and A1_2 as well.

But if a session is not refreshed (or closed and a new session is used), it will never see the changes which happened on the repository after the session has been opened.

As said before, these sessions do not need to run on 2 separate cluster nodes, you get the same behavior on a single cluster node as well. But I mentioned, that multiple cluster nodes are a special problem here. Why is that case?

That problem are OSGI services in the background, which perform a certain job and write data to the JCR repository. In a single-node cluster this not a problem, because all of these activities go through that single service; and if that service uses a long-running JCR session for it, that will never be a problem. Because this service is responsible for all changes, and the service can read and write all the relevant data. In a cluster with more than 2 nodes, each cluster node might have that service running, and the invocations of the services might be random. And as in the diagram above, on cluster node A the data A1_1 is written. And on cluster node 2 the data point B1_1 is written. But they don’t see each other’s changes if they don’t refresh the session! And in most applications, which are written for single-node AEM instances, session.refresh() is barely used, because in such situations there’s simply no need for it, as this problem never occurred.

So when you are migrating your application to AEM as a Cloud Service, review your applications and make sure that you find all long-running ResourceResolvers and JCR sessions. The best option is then to remove these long-running sessions and replace them with short-living ones, which are closed if the job is done. The second-best option is to introduce a session.refresh(), so the session sees any updates which happend to the repository in the meanwhile. (And btw: if you registering an ObservationListener in that session, you don’t need a manual refresh, as this refresh is done by the ObservationListener method anyway; what would it be for if not for reporting changes to the repository, which happen after opening the session?)

That’s all right now regarding cluster-aware coding. But I am sure that there is more to come 🙂