AEM micro-optimizations (part 3)

Welcome to my third post on AEM micro-optimizations. Again with some interesting ways how you can improve your AEM application performance, somethings with little improvements, but sometimes with significant ones.

During some recent performance optimization I came across code, which felt a bit odd. Technically it was quite easy:

for (Item item : manyItems) {
  proprocessSingleItem (resolver, item);
}
void processSingleItem (ResourceResolver resolver, Item i} {
// do something with the resourceResolver
resolver.commit();
}

That is indeed a very common pattern, especially in software, which evolved over time: You have code, which deals with a single item. And later, if you need to do it for multiple items, you execute this code in a loop. Works perfectly, and the pattern is widely used.

And it can be problematic.

If you have an operation in that performSingleItem() method, which comes with a method creating some overhead . Maybe you are not aware of that overhead, so it goes unnoticed. Maybe you expect, that if a that performSingleItem() method takes 5 ms for an item, requiring 50 ms for 10 items is ok. Well, an O(n) algorithm isn’t too bad, is it?

But what if I tell you, that the static overhead of that method is that so large, that providing 10 items as parameters  instead of just one will increase the runtime of it not by a factor of 10, but only by a factor of 1.1?

Imagine you need to go grocery shopping for your Sunday dinner. You get yourself ready, take the bike to the grocery store, get the potatoes you need. Pay, and get back home. Drop the potatoes there. Then again, taking the bike to the grocery store, getting the some meat. Back home. Again to the grocery store, this time for paprika (grilled paprika are delicious …). And so on and so on, until you have everything you need for your barbecue on Sunday. You spent now 6 hours mostly on the bike and waiting at the counter.

Are you doing that? No, of course not. You drive once to the grocery store, get all the things and pack them onto your bike, and get home. Takes maybe 90 minutes. Have the static overhead (cycling, waiting at the counter) just once saves a lot of it.

It’s the same in coding. You have static overhead (acquiring locks, getting database connections, network latency, calling through thick framework layers will just copying references to the data), which is not determined by the amount of data you process. But unlike in the example of grocery shopping it’s not directly visible at which times there is such a static overhead, and unfortunately documentation rarely point that out.

Writing to the repository comes with such a static overhead; and it can be like a 20 minutes ride to the grocery store. Saving 10 times smaller batches definitely takes more time than saving once with a batch of 10-times the size.  At least if you keep the size of the changeset limited, for details here check this earlier posting of mine.

Check this great presentation of Georg Henzler at adaptTo() 2019 (starting at 17:00min ) (slides) for some benchmark data, how the size of the changeset influences the time to save (spoiler: for realistic sizes it does not really increase).

So I changed the above code to something like this:

for (Item item : manyItems) {   
  proprocessSingleItem (resolver, item);
} 
resolver.commit();

void processSingleItem (ResourceResolver resolver, Item i} { 
  // do something with the resourceResolver but no commit
}

Switching to this approach improved the performance for ~ 100 items by a factor of more than 10! And that’s an impressive number for such a minimal change.

So check your code for this specific coding pattern, find out if the parameters are good (that means small changes) and add some performance logging. And then convert to this batching mode and see what your numbers are doing.

Of course, very often this saving is operating in the context of a much larger operation, and a 10 times improvement in this area will only speed up the larger operation of 12 seconds to 11 seconds. But hey, when you get this 1 second for almost free, just do it (and we are still talking about micro-optimizations). But nothing prevents you from taking a deeper look into what the system is doing in the remaining 11 seconds.

Leave me a comment if you have some interesting story to share, where such small changes resulted in big improvements.

AEM micro-optimization (part 2)

Micro optimizations are important, and their importance is described by a LWN posting about the linux kernel:

Most users are unlikely to notice any amazing speed improvements resulting from these changes. But they are an important part of the ongoing effort to optimize the kernel’s behavior wherever possible; a long list of changes like this is the reason why Linux performs as well as it does.

And is not specific for the Linux kernel, but you can apply the same strategy to every piece of software. AEM as a complex (and admittedly, it can sometimes be really slow) beast applies the very same.

There are a number of cases in AEM, where do you operate not only single objcets (pages, assets, resources, nodes), but apply the same operation on multiple of these objects.

The naive approach of just iterating the list and execute the operation on a single element of that list can be quite ineffective, especially if this operation comes with a static overhead.

Some examples:

  • For replication there are some pre-checks, then the creation of the package, the creation of the sling jobs (or sending the package to the pipeline when running on AEM as a Cloud Service), the update of the replication status, writing the audit log entries.
  • When determining the replication status of a page, the replication queues need to checked if this page is still subject to a pending replication, which can get slow when the queues are full.
  • Committing changes to the JCR repository; there is a certain overhead in it (validating all changes, comitting them to permanent storage, invoking the synchronous listeners, locking etc).

And in many cases these bottlenecks are known for a while, and there is API which allows to perform this action in a batch mode for a multitude of elements:

(The ReplicationStatusProvider has been introduced some years back when we had to deal with large workflow packages being replicated, which resulted in a lot of traversales of the replication queue entries. Adding this optimized version improved the performance by at least a factor of 10; so even in less intense operations I expect an improvement.

So if you have a hand-crafted loop to execute a certain activity on many elements, check if a more efficient batch API is available. There’s a good chance that it is already there.

If you have more cases where batch mode should be available, you it isn’t, leave a comment here. I am happy to support to either find the right API or potentially kickstart a product improvement.

AEM as a Cloud Service and the handling of binaries

When you are long-time user of AEM 6.x (and even CQ5), you are probably familiar with the Asset Update workflow. The primary task of it is the extraction of metadata from the binary asset and the creation of (smaller) renditions for it. This workflow is normally executed on the AEM authoring instance.

“Never underestimate the bandwidth …!” (symbolic photo)
Photo by Massimo Botturi on Unsplash

But since the begin of this approach it is plagued with problems:

  • The question of supported filetypes. Given the almost unlimited amount of file formats and their often proprietary implementation, it’s not always possible to perform these operations. In many cases, the support of these file types within Java is poor.
  • Additionally, depending on the size and the type of the asset and the quality of the library which provides support for this filetype, the processing can be very time consuming and also consume a lot of heap. Imagine that you can want to create renditions of a TIFF file which has dimensions of 10k * 10k pixels (assuming that you have a 24bit resolution) this requires 300 megabyte of contininous heap to store an uncompressed version of it. You have to size the heap size accordingly, otherwise you will run out of memory (OOM).
  • To avoid these issues, for many filetypes external tools like imagemagick were used, which both come with support of various image types (in many cases much better than the Java Image library), plus the ability not to blow the AEM process when the process fails (because imagemagick runs in a dedicated process). But also the capabilities of imagemagick are limited, and the support for more exotic (non-image) file types could be better.
  • In all cases you need to size your hardware for a worst case scenario. For example you need to provision a lot of heap, if your authors might start to ingest large images. And you need to provision enough CPU to mitigate negative impacts on all other operations.
  • Another big problem is the latency. Assuming that your asset is very large (it’s not uncommon to have assets larger than 1 Gigabyte), it takes time to copy the binary from the (remote) datastore to a location where the processing takes place. Even if you can transfer 100 MiB per second, it needs 10 seconds to have the file transferred to the local disk; normally this process runs through the AEM JVM, which is problematic in terms of heap usage, and also can cause performance problems. Not to mention code, which is not aware of the possible sizes and tries to load the complete stream into memory.

In AEM as a Cloud Service this is offloaded, and that’s what AssetCompute is for. It performs all these steps on its own; also not using imagemagick for image handling, but high quality and optimized routines which also power other Adobe products.

But what does that mean for you as developer for AEM as a Cloud Service? In the first place, it does not have any impact. But you should learn a few things from it:

  • Do not create any renditions on your own, use assetCompute instead. This service is extensible (checkout Project Firefly), so you can do all kind of asset operations there. There is no need anymore to use the java image library code.
  • Avoid streaming binary data through AEM. AEM as a Cloud Service itself (the JVM) should not be bothered with streaming binary data into and out of the JVM. If you want to upload files into AEM, you should use the aem-upload library

In general, think twice before you open an InputStream in AEM (either via Rendition.getStream() or also via the JCR API). Normally you never know how much data is behind it, and for almost all transformation cases it makes sense to use AssetCompute to perform these.

Writing unittests for AEM (part 4): OSGI services mock services

In the last parts of this small series (part 1, part 2, part 3) I covered some basic approaches how you can use the Sling and AEM mocking libraries to ease writing unittests. The examples were quite basic and focussed, but in reality many test cases turn out to be much more complex.

And especially when your code has dependencies to other OSGI services, tests can get tricky. So today I want to walk you through some unittest I wrote some time ago, it’s a unittest for the EnsureOakIndex functionality (EnsureOakIndexJobHandlerTest).

The interesting part is that the required EnsureOakIndex service references 4 other services in total; if they are not present, my EnsureOakIndex service will never start properly. Thus you have to fullfill all service requirements of an OSGI service in the unittest as well (at least if you want to use SlingContext like I do here).

The easiest way to solve this is to rely on predefined services which are part of the SlingMocks or AemMocks. The second best way is to create simple mocks and register them a service, so the dependency is fulfilled. That’s definitely a convenient way if your tests do not invoke any of the service methods at all.

Thus the setup() method of my unittests are often pretty large, because there I prepare and inject all other services which I need to make my software-under-test work.

And because this setup works quite well and reliably, I always use AemContext for my unittests (or SlingContext, but as I haven not yet observed any difference in test execution time, I often prefer just AemContext because it comes with some more sevices). Just if I don’t need resources, nodes and no OSGI, I stick with plain junit. For everything else AemContext removes the necessity for mocking a lot.

Optimizing Sling Models (updated)

A few days ago I found that interesting blog post at https://sourcedcode.com/blog/aem/aem-sling-model-field-injection-vs-constructor-injection-memory-consumption, which makes the claim that Constructor injection with Sling Models is much more memory efficient than the “standard” field-based injection. The claim is, that the constructor injection-approach “saves 1800% in bytes” (152 bytes vs 8 bytes in the example).

Well, that result is not correct, because the example implementations of the SlingModels used there are not identical. Because in the case of field-based injection the references are available during the complete lifetime of that SlingModel, not just during the @PostConstruct method call, thus these references consume memory.

While with the example of constructor-based injection, the references are just available during the constructor call; they are not available in any other method. If you want to achieve the same behavior as in the field-injection example, you have to store the references in the global fields and then the memory consumption of that SlingModel increases.

But Justin Edelson pointed out correctly, that you gain from constructor-based injection, if you need the references just in the constructor to compute some results (which are then stored in fields), and in no other method. That’s indeed a small optimization.

But let’s be honest: If we are talking about an additional memory overhead of 100 bytes per a complex SlingModel, that’s a negligible number. Because it’s not typical that hundreds of these models are created per second. And even in that case, when they are created to render a page, the models are garbage collected immediately after when the request is completed. It doesn’t matter if 100 bytes more or less are allocated and collected. Thus the overhead is normally not even measurable.

But well, you might hit the edge case, where this really makes a difference.

Update June 8th: I got informed that the referenced blog article has been updated. It now contains a more reasonable example which makes the sling models comparable. Basically it reflects now the optimization Justin already mentioned. And the difference in object size is now only 40 bytes vs 24 bytes.

Best practices for AEM unittests

Some time ago I already wrote some posts (1, 2, 3) about unit testing with AEM, especially in combination with SlingMocks / AEM Mocks.

In the last months I also spent quite some time in improving the unittests of ACS AEM Commons, mostly in the context of updating the Mockito framework from 1.9x to a more recent version (which is a pre-requisite to make the complete build working with Java 11). During that undertaking I reviewed a lot of unit tests which required adjustments; and I came across some patterns which I also find (often?) in AEM projects. I don’t think that these patterns are necessarily wrong, but they make tests hard to understand, hard to change and often these tests make production code overly complex.

I will list a few of these patterns, which I consider problematic. I won’t go that far and call them anti-patterns, but I will definitely look closely at every instance I come across.

Unittests don’t matter, only test coverage matters.
Sometimes I get the impression, that the quality of the tests don’t matter, but only the resulting test coverage (as indicated by the test coverage tools like jacoco). That paying attention to the code quality of the tests and investing time into refactoring tests is wasted time. I beg to differ.
Although unit tests are not deployed into a production environment, the usual quality measures should be applied to unit tests as well, because it makes them easier extensible and understandable. And the worst which can happen to production code is that a bugfix is not developed in a TDD (build a failing testcase first to prove your error is happening) way because it is to much work to extend the existing tests.

Mocking Sling Resources and/or JCR nodes
With the presence of AEM Mocks there should not be any need to manually mock Sling Resources and JCR nodes. It’s a lot of work to do that, especially if you compare it to load a JSON structure into an in-memory repository. Same with ResourceResolvers and JCR sessions. So don’t mock Sling resources and JCR nodes! That’s a case for AemMocks!

Using setters in services to set references
When you want to test services, the AEM Mock framework handles injections as well, you just need to use the default constructor of your service to instantiate it, and then pass it to the context.registerInjectActivate() method. If required create the referenced services before as mocks and register them as well. AemMocks comes with ways to test OSGI services and components in a very natural way (including activations and injection of references), so please use it.
There is no need to use setter methods for the service references in the production code just for this usecase!

If you are looking for an example how these suggestions can be implemented, you can have a look the example project I wrote last year.

Of course this list is far from being complete; if you have suggestions or more (anti-) patterns for unittests in the AEM area, please leave me a comment.

How to properly delete a page

A relevant aspect of any piece of content is the livecycle, the process of creation, modification, using and finally deletion of that content. And although the deletion of any page in AEM sounds quite easy, there are quite a few aspects which need to be dealt with. For example:

  • Create of a version of the page, so it can be restored.
  • Update the MSM structures (if required)
  • De-activate the page from publishing.
  • Create an entry in the audit.log

All this happens when you use one of the pagemanager.delete() function to remove the page. If you are not using it, the most obvious problem you’ll face afterwards is the fact, that you have published pages which you cannot delete anymore (because the page is missing on authoring), and you have to use a workaround for it.

So, please remember: The pagemanager might have overhead in many areas, but there is a reason for it to exist. Taking care of all these mentioned activities is one of it. So whenever you deal with pages (creating/moving/renaming/deleting), first check the pagemanager API before you start using the JCR or Sling API.

Safe handling of ResourceResolvers

Just digging through my posts of the last years, I found that my last post to ResourceResolvers and JCR sessions is more than a year old. But unfortunately that does not mean, that this aspects seems widely understood; I still see a lot of improper use of these topics, when I review project code as part of my job.

But instead of explaining again and again, that you should never forget to close them, I want to introduce a different pattern, which can help you to avoid the “old pattern” of opening and closing completely. It’s a pattern, which encapsulates the opening and closing of a ResourceResolver, and your code is executed then as a Consumer or Function within. The ResourceResolver cannot leak, and you cannot do anything wrong. The only pre-requisite is Java 8, but that must not be a problem in 2020.

// does not return anything
public void withResourceResolver (Map<String,Object> authenticationInfo, Consumer<ResourceResolver> consumer) {
   try (ResourceResolver resolver = ResourceResolverFactory.getResourceResolver(authenticationInfo);) {
     consumer.accept (resolver);
   } catch (Exception e) {
     LOGGER.error ("Exception happend while opening ResourceResolver",e);
   }
}

Same is possible with a function to return a value

// return a value from the lambda
public <T> T withResourceResolver (Map<String,Object> authenticationInfo, Function<ResourceResolver,T> function, T defaultValue) {
   try (ResourceResolver resolver = ResourceResolverFactory.getResourceResolver(authenticationInfo);) {
     return function.apply(resolver);
   } catch (Exception e) {
     LOGGER.error ("Exception happend while opening ResourceResolver",e);
   }
   return defaultValue;
}

// convenience function
public <T> T withResourceResolver (Map<String,Object> authenticationInfo, Function<ResourceResolver,T> function) {
   return withSession(authenticationInfo,function, null;)
}

So if you are not familiar with the functional style of Java 8, some small examples how to use these methods:

Map<String,Object> authenticationInfo = …
withResourceResolver(authenticationInfo, resolver -> {
   Resource res = resolver.getResource("/");
   // do something more useful, but return nothing 
});

// return a value from the lambda 
Map<String,Object> authenticationInfo = …
String result = withResourceResolver(authenticationInfo, resolver -> {
   Resource res = resolver.getResource("/");
   return res.getPath();
});

As you can easily see, you don’t need to deal anymore with the lifecycle of ResourceResolvers anymore. And if your authenticationInfo map is always the same, you can even hardcode it within the withSession() methods, so the only parameter remains the consumer or the function.

Prevent workflow launchers from starting a workflow

Workflow launchers are the standard way to trigger workflows based on changes in the content respository. The most prominent workflow which is triggered that way is the “Asset Update Workflow”, which does all the heavy lifting regarding asset processing. And it’s important to note that this workflow is executed on all changes to an asset itself, its renditions or on metadata.

But often this is not required. If you add more or custom meta data to an asset or even do it in a batch mode, you don’t want to this workflow to run at all; these metatadate changes are not relevant to assets themselves, but just to the way they should be handled in the specific context of your application.

The typical way to make the workflow not to start is to disable the workflow launcher (setting the “enabled” flag to “false”). But this is a global setting which affects all possible invocations, that means also the regular ingestion; and in that case the workflow has to run. So you need a way to specifically disable the workflow to start.

Fortunately there are a few ways how to achieve that, if you have the code under control, which performs the changes, and after which you don’t want the workflow to start again. This is key, because there is a feature available in the workflow launcher (sidenote: I just found that it has been documented; so it often makes sense to check documentation if there have been updates).

You can configure on the workflow launcher an exclusion property in the format “event-user-data:randomString”; this ignores all changes made by a JCR session which has a user-property “randomString” set.

How can you set that property? That’s quite easy:

Session session = ...;
session.getWorkspace().getObservationManager().setUserData("randomString");
// do you work with the session
session.save();

And by default the “Asset Update Workflow” is configured with “event-user-data:changedByWorkflowProcess”, so if your batch asset-operation sets the user-data to this string “changedByWorkflowProcess”, the “Asset Update Workflow” is not triggered anymore, without disabling the workflow launcher for it.

That’s it. And if you ever wanted to channel data from a saving session to the process which handles the observation events for it (the workflow launchers are just a very convenient way around the JCR Observation API): Just use event.getUserData().

How to use Runmodes correctly (update)

Runmodes are an essential concept within AEM; they form the main and only way to assign roles to AEM instances; the primary usecase is to distinguish between the author and publish role, and another common usecase is also to split between PROD, Staging and Development environments. Technically it’s just a set of strings which are assigned to an instance, and which are used by the Sling framework at a few occassions, the most prominent being the Sling JCR Installer (which handles the /apps/myapp/config,/apps/myapp/config.author, etc. directories).

But I see other usecases; usecases where the runmodes are fetched and compared against hardcoded strings. A typical example for it:

boolean isAuthor() {
return slingSettingsService.getRunmodes().contains("author");
}

From a technical point of view this is fully correct, and works as expected. The problem arises when some code is based on the result of this method:

if (isAuthor()) {
// do something
}

Because now the execution of this code is hardcoded to the author environment; which can get problematic, if this code must not be executed on the DEV authoring instances (e.g. because it sends email notifications). It is not a problem to change this to:

if (isAuthor() && !isDevelopmentEnvironment()) {
// do something
}

But now it is hardcoded again 😦

The better way is to rely on the OSGI framework soley. Just make your OSGI components require some configuration and define the configuration for the runmodes required.

@Component(configurationPolicy=ConfigurationPolicy.REQUIRED)
public class myServiceImpl implements myService {
//...

This case requires NO CODING at all, instead you can just use the functionality provided by Sling. And this component does not even activate if the configuration is not present!

Long story short: Whenever you see a reference to SlingSettingsService.getRunmodes(), it’s very likely used wrongly. And we can generalize it to “If you add a reference to the SlingSettingsService, you are doing something wrong”.

There are only a very few cases where the information provided by this service is actually useful for non-framework purposes. But I bet you are not writing a framework 🙂

Update (Oct 17, 2019): In a Twitter discussion Ahmed Musallam and Justin Edelson pointed out, that there are usecases around where this actually useful and the right API to use. Possibly yes, I cannot argue about that, but these are the few cases I mentioned above. I have never encountered them personally. And as a general rule of thumb it’s still applicable. Because every rule has its exceptions.

You think that I have written on that topic already? Yes, I did, actually 2 times already (here and here). But it seems that only repetition helps to get this message through. I still find this pattern in too many codebases.