Writing backwards compatible software

On last week’s adaptTo() conference I discussed the topic of AEM content migration with a few attendees, and why it’s a tough topic. I learned that the majority of these adhoc-migrations are done, because they are mandated by changes in the components themselves. And therefor migrations are required to adhere to the new expectations of the component. My remark “can’t you write your components in a way, that they are backwards compatible” was not that well received … it seems that it this is a hard topic for many.

And yes, writing backwards compatible components is not easy, because it comes with a few prerequisites:

  • The awareness, that you are making a change, which breaks compatibility with existing software and content. While the breakages in “software” can be detected easily, the often-times very loose contract between code and content is much harder to enforce. With some experience in that area you will develop a feeling for that, but especially less experienced folks can make such changes inadvertently, and you will detect that problem way too late.
  • You need to have a strategy which tells how to handle such a situation. While the AEM WCM Core Components introduced a versioning model, which seems to work quite nicely, an existing codebase might not be prepared for this. It forces some more structure and thoughts how to design your codebase, especially when it then comes to Sling Models and OSGI services, and where to have logic, so you don’t duplicate it.
  • And even if you are prepared for this situation, it’s not for free, you will end up with new versions of components which you need to maintain. Just breaking compatibility is much easier, because you still will have just 1 component.

So I totally get if you don’t care about backwards compatibility at all, because you are in the end the only consumer of your code, and you can control everything. You are not a product developer, where backwards compatibility needs to have a much higher priority.

But backwards compatibility gives you one massive benefit, which I consider as quite important: It gives you the flexibility to perform a migration to a time which is a good fit. It’s not that you need to perform this migration before, in the midst or immediately after a deployment. You deploy the necessary code, and then migrate thecontent when its convenient. And if that migration date is pushed further for whatever reason, it’s not a problem at all, because this backwards compatibility allows you to decouple the technical aspect of it (the deployment) from the actual execution of the content-migration. And for that you don’t need to re-scope deployments and schedules.

So maybe this is just me with the hat of a product developer, who is so focused on backwards compatibility. And in the wild the cost over backwards-compatibility is much higher than the flexibility it allows. I don’t know. Leave me a comment if you want to share your opinion.

Why I would deprecate InjectionStrategy.OPTIONAL for Sling Models

Sling Models offer a very convenient abstraction, as they allow data from the repository being mapped into fields of Java POJO classes. One feature I find often used is the optional InjectionStrategy. By default if an injection is not working, the instantiation of the POJO fails. When the InjectionStrategy.OPTIONAL field is set in the model annotation (see the Sling docs), such a non-working injection will not fail the creation of the model, but instead the field is left with the default value of the respective type. Which is null for Strings and other complex types. And this setting is valid for the entire class, so when you want to write reliable code, you would have to assume that every injected String property could be null.

This comes with a few challenges, because now you can’t rely anymore on values being non-null, but you would need to test each field if a proper value has been provided. Which is sometimes done, but in the majority of cases it is just assumed, that the field is non-null.

I wonder, why this is done at all. Because normally you write your components in a way that the necessary properties are always available. And if you operate with defaults, you can guarantee with several ways that they are available as soon as the component is being created and authored for the very first time. And while for a few cases a missing property must be dealt with for whatever reason, it is never justified to treat all property injections as optional. Because that would mean, that this sling model is supposed to make sense of almost any resource it is adapted from. And that won’t work.

And if a property is really optional: we added some time back the feature to use something like this (if you really can’t give a default value, which would be a much better choice):

@ValueMapValue
Optional<String> textToDisplay;

With this you can express the optionality of this value with the Java type system, and in that case it’s quite unlikely to miss the validation.

But if it would be just be up to me, I would deprecate InjectionStrategy.OPTIONAL and ban it, because it’s one of the most frequent reasons for NullPointer exceptions in AEM.

I know that using InjectionStrategy.OPTIONAL saves you from asking yourself “is this property always present?”, but that’s a very poor excuse. Because with just a few more seconds of work you can make your Sling Model more robust by just providing default values for every injected field. So please:

  • Avoid using optional injections when possible!
  • When it’s required use the Optional type to express it!
  • Don’t use InjectionStrategy.OPTIONAL!

Using “optional” (in all cases) can also come with a performance impact when used with the generic @Inject annotation; for that read my earlier blog posts on the performance of Sling Models: Sling Model Performance.

How not to do content migrations

(Note: This post is not about getting content from environment A to B or from your AEM 6.5 to AEM CS.)

The requirements towards content and component structure evolve over time; the components which you started initially with might not be sufficient anymore. For that reasons the the components will evolve, they need new properties, or components need to be added/removed/merged, and that must be reflected in the content as well. Something which is possible to do manually, but which will take too much work and is too error-prone. Automation for the rescue.

I already came across a few of those “automated content migrations”, and I have found a few patterns which don’t work. But before I start with them, let me briefly cover the one pattern, which works very well.

The working approach

The only working approach is a workflow, which is invoked on small-ish subtrees of your content. It skips silently over content which does not need to be migrated, and reports every situation which got migrated. It might even have a dry-run mode, which just reports everything it would change. This approach has a few advantages:

  • It will be invoked intentionally on author only, and only operates a single, well-defined subtree of content. It logs all changes it does.
  • It does not automatically activate every change it has done, but requires activation as a dedicated second step. This allows to validate the changes and activate it only then.
  • If it fails, it can repeatedly get invoked on the same content, and continue from were it has left.
  • It’s a workflow, with the guarantees of a workflow. It cannot time out as a request can do, but will complete eventually. You can either log the migration output or store it as dedicated content/node/binary data somewhere. You know when a subtree is migrated and you can prove that it’s completed.

Of course this is not something you can simply do, but it requires some planning in both designing, coding and the execution of the content migration.

Now, let’s face the few things which don’t work.

Non-working approach 1: Changing content on the fly

I have seen page rendering code, which tries to modify the content it is operating on, removing old properties, adding new properties either with default values and other values.

This approach can work, but only if the user has write permissions on the content. As this migration happens at the first time the rendering is initiated with write permissions (normally by a regular editor on the authoring system), it will fail in every other situation (e.g on publish if the merging conditions exist there as well). And you will have a non-cool mix of page rendering and content-fixup code in your components.

This is a very optimistic approach, over which you don’t have any control, and for that reason you probably can never remove that fixup code, because you never know if all content has already been changed.

Non-working approach 2: Let’s do it on startup

Admitted, I have seen this only once. But it was a weird thing, because a migration OSGI service was created, which executed the content migration in its activate() method. And we came across it because this activate delayed the entire startup to a situation, which caused our automation to run into a timeout, because we don’t expect a startup of an AEM instance to take 30+ minutes.

Which is also its biggest problem and which makes it unusable: You don’t have any control over this process, it can be problematic in the case of clustered repositories (in AEM CS authoring) and even if the migration has already been completed, the check if there’s something to do can take quite long.

But hey, when you have it already implemented as service, it’s quite easy to migrate it to a workflow and then use the above recommended approach.


Let me know if you have found other cases of working or non-working approaches for content migration; but in my experience it’s always the best way to make this an explicit task, which can be planned, managed and properly executed. Everything else can work sometimes, but definitely with a less predictable outcome.

SQL injection in AEM?

TL;DR While SQL injection in AEM is less a problem than in other web frameworks, it should not be ignored. Because being able to read and extract content can pose a problem. For that reason review your code and your permission setup.

When you follow the topic of software security, you are probably well aware of the problem of “SQL injection“, a situation in where an attacker can control (parts of) a SQL command which his sent to a database. In one way or another, this SQL injection is part of the OWASP Top 10 issues for a really long time. And even if almost all application frameworks have built-in ways to mitigate it, it’s still a problem.

From a highlevel perspective, AEM-based applications can also be affected by SQL injections. But due to its design, the impact is less grave:

  • JCR SQL/XPath or the QueryBuilder are just query languages, and they don’t support any kind of operations which create or modify content.
  • These queries always operate on top of a JCR session, which implements resource based access control on top of principals (users and groups).

These 2 constraints limit the impact of any type of “SQL injection” (in which an attacker can control parts of a query), because as an attacker you can only retrieve content which the principal you are impersonating has read access to. For that reason a properly designed and implemented permission setup will prevent that any sensitive data, which should not be accessible to that principal, can be extracted; and modifications are not possible at all.

Nevertheless, SQL injection is possible. I frequently see code, in which parameters are passed with a requests, which are not validated and checked, but instead are passed unfiltered as repository path into queries or other API calls. Of course this will cause exceptions or NPEs if that repository path is accessible to the session associated with that request. But if that user session has read access to more data than it actually needs (or even uses a privileged session which has even more access to content), an attacker can still access and potentially extract content which the developers have not planned for.

So from a security point of view you should care about SQL injection also in AEM:

  • Review your code and make sure that it does proper input validation.
  • Review your permission setup. Especially check for the permissions of the anonymous user, as this user is used for the non-authenticated requests on publish.
  • Make sure that you use service users only for their intended purposes. On the other hand, the security gain by service-users is very limited, if code invoked and parameterized by an anonymous request executes a query with a service-user on restricted content only accessible to this service-user. In that case you can make that restricted content readable directly to the anonymous user and it would not be less secure.

AEM CS: Java 21 update

After a lengthy preparation period, this year the rollout of Java 21 will start for AEM as a Cloud Service. While the public documentation contains all relevant information (and I don’t want to reiterate them here), I want to make a few things more clear.

First, this is the update of the Java version used to run AEM as a Cloud Service. This version can be different from the Java version which is used to build the application. As Java versions are backwards compatible and can read binaries created by older versions, it is entirely possible to run the AEM CS instance with Java 21, but still build the application with Java 11. Of course this restricts you to the language features of Java 11 and for example you cannot use Records, but besides that there is no negative impact at all.

This scenario is fully supported; but at some point you need to update your build version to a newer Java version, as freshly added APIs might use Java features which are not available in Java 11. And as a personal recommendation I would suggest to switch also your build time Java version to Java 21.

This change of the runtime Java version should in most cases be totally invisible for you; at least as long as you don’t use or add 3rd-party libraries, which need to support new Java versions explicitly; the most prominent libraries in the AEM context are Groovy (often as part of the Groovy console) and the ASM library (a library which allows to create and modify Java bytecode). If you deploy one of these into your AEM instance, make sure that you update these to a version which supports Java 21.

JCR queries with large result sets

TL;DR: If you expect large result sets, try to run that query asynchronously and not in a request; and definitely pay attention to the memory footprint.

JCR queries can be a tricky thing in AEM, especially when it comes to their performance. Over the years practices have emerged, with the most important of them being “always use an index”. You can find a comprehensive list of recommendations in the JCR Query cheat sheet for AEM.

There you can also find the recommendation to limit the size of the result set (it’s the last in the list); while that can definitely help if you need just 1 or a handful of results, this recommendation is void if you need to compute all results of a query. And that situation can get even worse if you know that this result set can be large (like thousands or even tens of thousands of results).

I have seen that often, when content maintenance processes were executed in the context of requests, which took many minutes in an on-prem setup, but then failed on AEM CS because of the hard limit of 60 seconds for requests.

Large result sets come with their own complexities:

  1. Iterating through the entire result set requires ACL checking plus the proper conversion into JCR objects. That’s not for free.
  2. As the query engine puts a (configurable) read limit to a query, it can have a result set of at maximum 100k nodes by default. This number is the best case, because any access to the repository to post-filter the result delivered by the Lucene index also counts towards that number. If you cross that limit, reading the result set will terminate with an exception.
  3. The memory consumption: While the JCR queries provide an iterator to read the result set, the QueryBuilder API provides API which read the entire result set and return it as a list (SearchResult.getHit()). If this API is used, just the result set can consume a significant amount of heap.
  4. And finally: what does the application do with the result set? Is it performing an operating with each result individually and then does not the single result anymore? Or does it read each result from the query, performs some calculations and stores them again in a list/array for the next step of processing. Assuming that you have 100k querybuilder Hits and 100k custom objects (potentially even referencing the Hit objects), that can easily lead to a memory consumption in the gigabytes.
  5. And all that could happen in parallel.

In my experience all of these properties of large result sets mandate that you run such a query asynchronously, as it’s quite possible that this query takes tens of seconds (even minutes) to complete. Either run it as a Sling Job or using a custom Executor in the context of an OSGI service, but do not run them in the context of request, as in AEM CS this request has the big chance to time out.

This was 2024

Wow, another year has passed. Time for a recap.

My personal goal for 2024 in this blog was to post more often and more consistently, and I think that I was successful at that. When I counted correctly, it were 20 posts in 2024. The consistency in the intervals could be better (a few just days apart, other multiple weeks), but unlike in some other years I never really felt, that I was lagging way behind. So I am quite happy with it and will try to do the same in 2025.

This year I adopted 2 ideas from other blogs:

  • A blog post series, which is planned as such. In January and February I posted 5 posts on Modeling  Performance Tests (starting here). This approach worked quite well, mostly because I spent enough time to write them before I made the first post public. If I know upfront that topics are large enough, I will continue with this type.
  • The “top N things …” type of posts. I don’t particular like this type of posting, because very often they just scream for attention and clicks, without adding much value. I used that approach 2 times (The new AEM CS feature in 2024 I love most and My top 3 reasons why page rendering is slow) ; and then mostly to share links to other pages. It can work that way, but that will never be my favorite type of blog post.

The most successful blog post of 2024: As I did not add any page analytics to this page (I would need a cookie banner then), I have only some basic statistics from WordPress. The top 3 requested pages besides the start page in 2024 were:

  1. CQ development patterns – Sling ResourceResolver and JCR Sessions (written in 2013)
  2. Do not use AEM as proxy for backend calls (of 2024)
  3. How to analyze “Authentication support missing” (of 2023)

Interesting that a 10 year old article was requested most often. Also WordPress showed me that LinkedIn was a significant source of traffic, so I probably should continue to announce blog posts there. (If you think I should also do announcements elsewhere, let me know.)

And just today I saw the latest video from Tad Reeves, where he mentioned my article on performance testing in AEM CS. Thank you Tad, I really appreciate your feedback and the recognition!

That’s for 2024! I wish you all a relaxing break and a successful year 2025!

My top 3 reasons why page rendering is slow

In the past years I was engaged in many performance tuning activities, which related mostly to slow page rendering on AEM publish instances. Performance tuning on authoring side is often different and definitely much harder :-/

And over the time I identified 3 main types of issues, which make the page rendering times slow. And slow page rendering can be hidden by caching, but at some point the page needs to be rendered, and often it makes a difference if this process takes 800ms or 5 seconds. Okay, so let’s start.

Too many components

This is a pattern which I see often in older codebases. Often pages are assembled out of 100+ components, very often in deep nesting. My personal record I have seen were 400 components, nested in 10 levels. This normally causes problems in the authoring UI because you need to very careful to select the correct component and its parent or a child container.

The problem on the page rendering process is the overhead of each component. This overhead consists of the actual include logic and then all the component-level filters. While each inclusion and each component does not take much time, the large number of components cause the problem.

For that reason: Please please reduce the number of components on your page. Not only the backend rendering time, but also the frontend performance (less javascript and CSS rules to evaluate) and the authors experience will benefit from it.

Slow Sling models

I love Sling Models. But they can also hide a lot of performance problems (see my series about optimizing Sling Models), and thus can be a root-cause for performance problems. In the context of page rendering and Sling Models backing HTL scripts, the problem are normally not the annotations (see this post), but rather the complex and time-consuming logic when the models are instantiated, most specifically the problems with executing the same logic multiple times (as described in my earlier post “Sling Model Performance (Part 4)“).

External network connections

This pattern requires that during page rendering a synchronous call is done towards a different system; and while this request is executed the rendering thread on the AEM side is blocked. This can turn into problems if the backend is either slow or not available. Unfortunately this is the hardest case to fix, because removing this often requires a re-design of the application. Please see also my post about “Do not use AEM as a proxy for backend calls” for this; it contains a few recommendations how to avoid at least some of the worst aspects, for example using proper timeouts.

Sling model performance (part 4)

I think it’s time for another chapter in the topic of Sling Model performance, just to document some interesting findings I have recently made in the context of a customer project. If you haven’t read them, I recommend you to check the first 3 parts of this series here:

In this blog post I want to show the impact of inheritance in combination with Sling Models.

Sling Models are simple Java POJOs, and for that reason all features of Java can be used safely. I have seen many projects, where these POJOs inherit from a more or less sophisticated class hierarchy, which often reflect the component hierarchy. These parent classes also often consolidate generic functionality used in many or all Sling Models.

For example many Sling Models need to know the site-root page, because from there on they build links, the navigation, read global properties from etc. For that reason I have seen in many parent classes code like this:

public class AbstractModel {

  Page siteRoot;

  public void init() {
    siteRoot = getSiteRoot();
    // and many more initializations
  }
}

And then this is used like this by a Sling Model called ComponentModel:

public class ComponentModel extends AbstractModel {

  @PostConstruct
  public void init() {
    super();
  }
  ...
}

That’s all straight forward and good. But only until 10 other Sling Models also inherit from the AbstractModel, and all of them also invoke the getSiteRoot() method, which in all cases returns a page object representing the same object in the repository. Feels redundant, and it is. And it’s especially redundant, if a Model invokes the init() method of its parent and does not really need all of the values calculated there.

While in this case the overhead is probably small, I have seen cases where the removal of this redundant code brought down the rendering time from 15 seconds to less 1 second! That’s significant!

For this reason I want to make some recommendations how you can speed up your Sling Models when you use inheritance.

  • If you want or need to use inheritance, make sure that the parent class has a small and fast init method, and that it does not add too much overhead to each construction of a Sling Model.
  • I love Java Lambdas in this case, because you can pass them around and only invoke them when you really need their value. That’s ideal for lazy evaluation.
  • And if you need to calculate values more than once, store them for later reuse

Monitoring Java heap

Every now and then I get the question: “What do you think if we alert at 90% heap usage of AEM?”. The answer is always longer, so I write it down here for easier linking.

TL;DR: Don’t alert on the amount of used heap, but only on garbage collection.

Java is language which relies on garbage collection (GC). Unlike other programming languages memory is managed by the runtime. The operator assigns a certain amount of RAM to the java process for usage, and that’s it. A large fraction of this RAM goes into the heap, and the Java Virtual machine (JVM) manages this heap entirely on its own.

Now, as every good runtime, the JVM is lazy and does work only when it’s required. That means it will start the garbage collection only when then the amount of free memory is low. This is probably over-simplified, but good enough for the purpose of this article.

That means that the heap usage metrics show that the heap usage is approaching 100%, and then it suddenly drops to a much lower value, because the garbage collection process just released memory which is no longer required. And then the garbage collection pauses and the processing goes on, consuming memory, until at some point the garbage collection starts again. This leads to the typical saw-tooth pattern of the JVM.

(source: Interesting Garbage Collection Patterns by Ram Lakshamanan)

For that reason it’s not helpful to use the heap usage as alerting metric, as it fluctuates too much, and it will alert you when the actual memory usage is already down.

But of course there are other situations, where the saw-tooth pattern gets less visible, as the garbage collection can release less memory with each run, and that can indeed point to a problem. How can this get measured?

In this scenario the garbage collection runs more frequently, and the less the garbage collection releases, the more often it runs, until the entire application is effectively stopped and only the garbage collection is running. That means that here you can use the amount of the time the garbage collector runs per time period. Anything below 5% is good, and anything beyond 10% is a problem.

For that reason, rather measure the garbage collection, as it is a better indicator if your heap is too small.