The effect of micro-optimizations

Optimizing software for speed is a delicate topic. Often you hear the saying “Make it work, make it right, make it fast”, implying performance optimization should be the last step you should do when you code. Which is true to a very large extent.

But in many cases, you are happy if your budget allows to you to get to the “get it right” phase, and you rarely get the chance to kick off a decent performance optimization phase. That’s a situation which is true in many areas of the software industry, and performance optimization is often only done when absolutely necessary. Which is unfortunate, because it leaves us with a lot of software, which has performance problems. And in many cases a large part of the problem could be avoided if only a few optimizations were done (at the right spot, of course).

But all this statement of “performance improvement phase” assumes, that it requires huge efforts to make software more performant. Which in general is true, but there are typically a number of actions, which can be implemented quite easily and which can be beneficial. Of course these rarely boost your overall application performance by 50%, but most often it just speeds up certain operations. But depending on the frequency these operations are called it can sum into a substantial improvement.

I did once a performance tuning session on an AEM publish instance to improve the raw page rendering performance of an application. The goal was to squeeze more page responses out of the given hardware. Using a performance test and a profiler I was able to find the creation of JCR sessions and Sling ResourceResolvers to take 1-2 milliseconds, which was worth to investigate. Armed with this knowledge I combed through the codebase, reviewed all cases where a new Session is being created and removed all cases where it was not necessary. This was really a micro-optimization, because I focussed on tiny pieces of the code (not even the areas which are called many times) , and the regular page rendering (on a developer machine) was not improving at all. But in production this optimization turned out to help a lot, because it allowed us to deliver 20% more pages per second out of the publish at peak.

In this case I spend quite some amount of time to come to the conclusion, that opening sessions can be expensive under load. But now I know that and spread that knowledge via code reviews and blog posts.

Most often you don’t see the negative effect of these anti-patterns (unless you overdo it and every Sling Models opens a new ResourceResolver), and therefor the positive effects of applying these micro-optimizations are not immediately visible. And in the end, applying 10 micro-optimizations with a ~1% speedup each sum up to a pretty nice number.

And of course: If you can apply such a micro-optimization in a codepath which is heavily used, the effects can be even larger!

So my recommendation to you: If you come across such a piece of code, optimize it. Even if you cannot quantify and measure the immediate performance benefit of it, do it.

Same as:

(for int=0;i<= 100;i++) {
  othernumber += i;
}

I cannot quantify the improvement, but I know, that

othernumber += 5050;

is faster than the loop, no questions asked. (Although that’s a bad example, because hopefully the compiler would do it for me.)

In the upcoming blog posts I want to show you a few cases of such micro-optimizations in AEM, which I personally used with good success. Stay tuned.

(Photo by Michael Longmire on Unsplash)

Writing integration tests for AEM, part 5

This a part of my ongoing series about writing integration tests with AEM.

Integration tests help you to keep control
Photo by Chris Leipelt on Unsplash

Writing tests seems to be a recurring topic 🙂 This week I wrote some integration tests which included one of the most important workflows in AEM: Activation of pages. Right now haven’t blogged about the handling of both author and publish in an integration test. I will show you how to do it.

So let’s assume that you want to do some product testing and validate that replication is working and also writes correct audit log entries. This should be covered with an integration test. You can find the complete sourcecode in the ActivatePageIT at the integrationtests github project.

Before we dig into the code itself, a small hint for the development phase of tests. If you can want to execute only a single integration tests, you can instruct maven to do this with the parameter “-Dit.test=<Name of the testclass>”. So in our case the complete maven command line looks like this:

mvn clean install -Peaas-local -Dit.test=ActivatePageIT -Dit.author.url=http://localhost:4502

(assuming that you don’t run your AEM author on same port as I do … if you want to change that, modify the parameters in the pom.xml).

On the coding side, the approach follows of every integration test: we need to get the correct clients first:

As we want to use replication, we use a ReplicationClient, which is provided by the testing client library.

Next we define use a custom Page class, which allows us to define the parentPath:

Then the actual test case is straight forward.

I used some more features of the testing clients to just test the existence or absence of the page, plus the doGetJson() method to get the JSON representation of the pages (in the getAuditEntries() method).

So, writing integration tests with this tooling at hand is easy and actual fun. Especially if the test code is straight forward to implement like here.

AEM as a Cloud Service and the handling of binaries

When you are long-time user of AEM 6.x (and even CQ5), you are probably familiar with the Asset Update workflow. The primary task of it is the extraction of metadata from the binary asset and the creation of (smaller) renditions for it. This workflow is normally executed on the AEM authoring instance.

“Never underestimate the bandwidth …!” (symbolic photo)
Photo by Massimo Botturi on Unsplash

But since the begin of this approach it is plagued with problems:

  • The question of supported filetypes. Given the almost unlimited amount of file formats and their often proprietary implementation, it’s not always possible to perform these operations. In many cases, the support of these file types within Java is poor.
  • Additionally, depending on the size and the type of the asset and the quality of the library which provides support for this filetype, the processing can be very time consuming and also consume a lot of heap. Imagine that you can want to create renditions of a TIFF file which has dimensions of 10k * 10k pixels (assuming that you have a 24bit resolution) this requires 300 megabyte of contininous heap to store an uncompressed version of it. You have to size the heap size accordingly, otherwise you will run out of memory (OOM).
  • To avoid these issues, for many filetypes external tools like imagemagick were used, which both come with support of various image types (in many cases much better than the Java Image library), plus the ability not to blow the AEM process when the process fails (because imagemagick runs in a dedicated process). But also the capabilities of imagemagick are limited, and the support for more exotic (non-image) file types could be better.
  • In all cases you need to size your hardware for a worst case scenario. For example you need to provision a lot of heap, if your authors might start to ingest large images. And you need to provision enough CPU to mitigate negative impacts on all other operations.
  • Another big problem is the latency. Assuming that your asset is very large (it’s not uncommon to have assets larger than 1 Gigabyte), it takes time to copy the binary from the (remote) datastore to a location where the processing takes place. Even if you can transfer 100 MiB per second, it needs 10 seconds to have the file transferred to the local disk; normally this process runs through the AEM JVM, which is problematic in terms of heap usage, and also can cause performance problems. Not to mention code, which is not aware of the possible sizes and tries to load the complete stream into memory.

In AEM as a Cloud Service this is offloaded, and that’s what AssetCompute is for. It performs all these steps on its own; also not using imagemagick for image handling, but high quality and optimized routines which also power other Adobe products.

But what does that mean for you as developer for AEM as a Cloud Service? In the first place, it does not have any impact. But you should learn a few things from it:

  • Do not create any renditions on your own, use assetCompute instead. This service is extensible (checkout Project Firefly), so you can do all kind of asset operations there. There is no need anymore to use the java image library code.
  • Avoid streaming binary data through AEM. AEM as a Cloud Service itself (the JVM) should not be bothered with streaming binary data into and out of the JVM. If you want to upload files into AEM, you should use the aem-upload library

In general, think twice before you open an InputStream in AEM (either via Rendition.getStream() or also via the JCR API). Normally you never know how much data is behind it, and for almost all transformation cases it makes sense to use AssetCompute to perform these.

Long running sessions and SegmentNotFoundExceptions

If you search this blog, you find one recurring theme over the years: The lifecycle of JCR sessions and Sling ResourceResolvers. That you should not keep them open for a long time. And that you definitely have to close them. But I never gave you an example what can happen if you don’t follow this recommendation. Until now.

These days I learned that was is actual problem which can arise because of it. And the problem is called “SegmentNotFoundException”.

In the past a SegmentNotFoundException was a clear indication of a corrupt JCR repository. The recommendation was always either to fix it or to restore from backup. Both operations are tedious, require downtimes and possibly also mean a loss of data. That’s probably also the reason why this specific problem is often taken for the sign of such a repository exception. So let’s systematically look at it.

The root cause

With AEM 6.4 the feature of “tail-compaction” was introduced, which is a version of the online compaction feature. It is less efficient but takes less time than the full compaction. By default in AEM the tail compaction runs daily and the online compaction once a week.

But from what I understood, this tail compaction has a problem with long running sessions, and it can happen, that tar files are compacted and removed, which are still referenced. That means, that it’s not really a on-disk corruption which needs to be fixed, but rather that some “old sessions” (read about MVCC in the previous post) are referencing data which is not there (anymore).

An unclosed session – a symbol photo (by engin akyurt on Unsplash)

Validate the symptoms

This problem I describe in this post happens under some special circumstances, which you should check first before you start the hunt for long-running sessions:

  • You get SegmentNotFoundExceptions (always with the same segment ID).
  • A repository check doesn’t find any inconsistency.
  • If you restart the instance, the error is gone, but appears again after some time (mostly at least a day).
  • You are running AEM 6.4 or AEM 6.5 (SP doesn’t seem to matter).

In the case I observed, only a single workflow step was affected, but not all the time and only after some time, which made me believe that it was related to the compaction. But it was very hard to track down the error, because the workflow step itself was complex, but safe.

The solution

Fix any long-running session in your application (unless you are registering an ObservationListener in there, which takes care of the refreshs by design). Really all. Use the JMX webconsole plugin and check the list of registered session mbeans every day on a production instance. Count them. Look at the timestamps when the session was opened.

 In the case I observed, the long running session was in a different area of the application, but was working on the same data (user profiles) as the failing workflow step. But the 2 areas in the code were totally unrelated to each other, so that was the only way to track it down.

Final words

Some other notes, which I consider as important in this context:

  • When you encounter a SegmentNotFoundException, please always open a support ticket, just in case. If it’s a different issue than described here, it’s better if you have that ticket open already.
  • If you see exactly this issue, and changing your application code makes this problem go away, please also raise a support ticket. That bug should get fixed (even if long-running sessions are not recommended since years).
  • As mentioned, when you encounter this issue, it’s not a persisted corruption. Restarting will cause the issue not to appear for some time, but that should only buy you time to identify and fix the long running sessions.
  • And AEM as a Cloud Service is not affected by this problem, because neither Online Compaction nor Tail Compaction are used. Instead the Golden Master is offline compacted before cloning.

Long running sessions and clustering

In the last blog post I briefly talked about the basics what to consider when you are writing cluster-aware code. The essence is to be aware of your write activities, and make sure that the scheduled activities are running only on a single cluster node and not on many or all of them.

Today’s focus is on the behavior of JCR sessions with respect to clustering. From a conceptual point of view there is hardly a difference to a single-node cluster (or standalone instance), but the presence of more cluster nodes add a new angle of potential problems to it.

When I talk about JCR, I am thinking of the Apache Oak implementation, which is implemented on top of the MVCC pattern. (The previous Jackrabbit implementation is using a different approach, so this whole blog post does not apply there.) The basic principle of MVCC is that each session is clearly separated from any other session which is open in parallel. Also any changes performed on a session is not visible to other sessions unless

  • the other session is invoking session.refresh() or
  • the other session is opened after the mentioned session is closed.

This behavior applies to all sessions of a JCR repository, no matter if the are opened on the same cluster node or not. The following diagram visualizes this

Diagram showing how 2 sessions are performing changes to the repository whithout seeing the changes of the other as long as they don’t use session.refresh()

We have 2 sessions A1 and B1 which are initiated at the same time t0, and which perform changes independently of each other on the repository, so session B1 cannot see the changes performed with A1_1 (and vice versa). At time t1 session A1 is refreshed, and now it can see the changes B1_1 and B1_2. And afterwards B1 is refreshed as well, and can now see the changes A1_1 and A1_2 as well.

But if a session is not refreshed (or closed and a new session is used), it will never see the changes which happened on the repository after the session has been opened.

As said before, these sessions do not need to run on 2 separate cluster nodes, you get the same behavior on a single cluster node as well. But I mentioned, that multiple cluster nodes are a special problem here. Why is that case?

That problem are OSGI services in the background, which perform a certain job and write data to the JCR repository. In a single-node cluster this not a problem, because all of these activities go through that single service; and if that service uses a long-running JCR session for it, that will never be a problem. Because this service is responsible for all changes, and the service can read and write all the relevant data. In a cluster with more than 2 nodes, each cluster node might have that service running, and the invocations of the services might be random. And as in the diagram above, on cluster node A the data A1_1 is written. And on cluster node 2 the data point B1_1 is written. But they don’t see each other’s changes if they don’t refresh the session! And in most applications, which are written for single-node AEM instances, session.refresh() is barely used, because in such situations there’s simply no need for it, as this problem never occurred.

So when you are migrating your application to AEM as a Cloud Service, review your applications and make sure that you find all long-running ResourceResolvers and JCR sessions. The best option is then to remove these long-running sessions and replace them with short-living ones, which are closed if the job is done. The second-best option is to introduce a session.refresh(), so the session sees any updates which happend to the repository in the meanwhile. (And btw: if you registering an ObservationListener in that session, you don’t need a manual refresh, as this refresh is done by the ObservationListener method anyway; what would it be for if not for reporting changes to the repository, which happen after opening the session?)

That’s all right now regarding cluster-aware coding. But I am sure that there is more to come 🙂

Cluster aware coding in AEM

With AEM as a Cloud Service quite a number of small things have changed; and next to others you also get real clustering support in the authoring environment. Which is nice, because it gives you downtime-less authoring during deployments.

But this cluster also comes with a few gotchas, and one of them is that your application code needs to be cluster-aware. But what does that mean? What consequences does it have and what code do you have to change if you have never paid attention to this aspect?

The most important aspect is to do “every change only once“. It doesn’t make sense that 2 cluster nodes are importing the same set of data. A special version of this aspect is “avoid concurrent writes to the same node“, which can happen when a scheduled job is kicked off at the same time on all nodes, and this job is trying to change something in the repository. In this case you don’t only have overhead, but very likely a lot of exceptions.

And there is a similar aspect, which you should pay attention to: connections to external systems. If you have a cluster, running the same code and configs, it’s not always wanted that each cluster node reaches out to that external system. Maybe you need to the update it with the latest content only once, because it triggers some expensive processing on their side, and you don’t want to have that triggered two or three times, probably pretty much at the same time.

I have mentioned you 2 cases where a clustered application can be behave differently than a single-node environment, now let me show you how you can make your application cluster-aware.

Scheduled jobs

Scheduled jobs are a classic tool to execute certain jobs at a certain time. Of course we could use the Sling Scheduler directly, but to make the execution more robust, you should wrap it into a Scheduled Sling Job.

See the Sling Jobs website for the documentation and some example (although the Javadocs are missing the ScheduleBuilder class, but here’s the code). And of course you should check out Kaushal Mall’s post with even more examples.

Jobs give you the guarantee, that this job is going to be executed only at least once.

Use the Sling Scheduler only for very frequent jobs (e.g. once every 5 minutes), where it doesn’t matter if one execution is skipped, e.g. because the instance was just restarting. To limit the execution of such a job to a single node, you can annotate the job runner with this annotation:

@Property (name="scheduler.runOn", value="SINGLE")

(see the docs)

What about caches?

In-memory caches are often used to speed up operations. Most often they contain the results of previous operations which are then reused; cache elements are either actively purged or expire using a time-to-live.

Normally such caches are not affected by clustering. They might contain different items with potentially different values in the cluster nodes, but that must never be a problem. If that is a problem, you have to look for a different approach, e.g. persisting the data to the repository (if they are not already coming from there) or externalizing the cache (e.g to a redis or memcached instance).

Also, having a simpler application instead of the highest-cache-hit ration possible is often a good trade-off.

Ok, these were the topics I wanted to discuss here. But expect a blog post about one of my favorite topics: “Long running sessions and clustering”.