Handling Campaign traffic in AEM

Server racks with brightly lit fiber optic cables representing data flow and a technician monitoring data on multiple screens

It must have been 2007 when I have seen that urls with a query string “utm_id=someHexCode” in the logs of the Communiqe system I ran at that time. I still remember that we had about 4’000 of them on any given day, which was not that much of a problem. But I didn’t know back then, that we will still deal with the very same requests more than 15 years later, but with an even higher rate and with more severe consequences.

What is special with these query strings? The most important thing forbackend folks is that these query strings are a frontend topic. They are used to attribute requests to a certain source, which is important for the Analytics folks to track the effectiveness for their campaigns.
For example when there is this query string “utm_id=cm2026-35-1” it could be the code of “email blast 1 of campaign 35 in 2026”. If a user clicks on that link in an email, the analytics code in the page will read this query string and report it to the analytics server. And this then allows to track the conversion rate or efficiency of this particular email blast and compare it to the results of a Facebook ad or other sources.

So this special type of traffic has 2 aspects which are important for backend folks like me:

It typically happens in spikes: Right after the distribution of these emails (either via ads, emails or whatever other way of distributing it) users will click it.
These query srings have a meaning only on the frontend side, but on the backend these parameters are not used at all.

But as most caches don’t cache any response where the query string contains a query string, such requests bypass all CDN caches by default. That means that such requests end up very frequently on the AEM publish instance for rendering, while from a backend perspective all of the following requests will produce the same results:

/content.html (the response of this request could be cached)
/content.html?utm_id=campaign1 (non-cacheable)
/content.html?utm_id=campaign25 (non-cacheable)

And because these requests happen frequently in spikes, this often leads to situations that such a campaign triggers an overload of the AEM Publish layer. Which is sad, because your expensive and successful marketing campaign is responsible for a server-side outage, and instead of a great experience you serve your customers a slow site and/or errors.

Unfortunately I see too many of those situations.

What is the AEM answer to it?

The general idea to handle this situation is to strip off these campaign parameters from the request, which makes turns them into requests for which the response can be served from a cache, where the usual caching and expiration rules are applied. Where and how this can be done is depends.

On AEM CS the best way to handle this directly on the CDN (using Traffic Rules to normalize requests); this is the best solution, because any campaign traffic is served directly from the CDN, and it’s not bothering origin (that means the dispatcher and the publish instances). If you are on AEM CS you should use this approach.

A concept which is can be implemented on any AEM setup is to implement it on the dispatcher. With the /ignureUrlParams command you specifiy the parameters which should be stripped from the request. If there is no query string left, the request is considered cacheable, and it’s checked against the usual dispatcher rules.

But in every case you need to be able to identify the query strings which you know will be used in the context of your AEM application. If you know them, you also know that you can ignore everything else. Configure them into the traffic rules or the /ignoreUrlParams section.

Every AEM instance should have this configured in order to handle such traffic spikes.

AEM CS API deprecations

You might have received alerts already, that your application is using deprecated APIs, and that you should act now. Many of these warnings are out there for quite a bit of time, but have never been enforced. Looks like now it’s the time.

As I handled a few of these cases and warnings already, a few words to them.

Use the aemanalyser-maven-plugin

To speedup the feedback cycle, use the aemanalyser-maven-plugin to perform the same validation the fullstack pipeline would do. Execute the following steps to make it work:

Add a dependency to the aemanalyser-maven-plugin to your all/pom.xml. Make sure that you are using the latest version 1.6.16.
Configure it to execute the project-analyse goal (also in the all/pom.xml); see here how it’s done the AEM archetype

Strictly speaking this is not necessary in the context of the API deprecation process, but having it will make your life much easier.

Update ACS AEM Commons

ACS AEM Commons is a library which is used by many AEM CS customers. Updating it to the version 6.11.0 (or later) makes all warnings go away. Just make sure that you use the “cloud” classifier. And if you are still referencing or embedding the “twitter” module, remove it, as it has not been update for a long time, and I doubt that it is still working.

Guava

If you have a reference to the use of the deprecated API packages “com.google.*”, these are caused by the upcoming removal of Guava from the AEM public API. While the official documentation mentions to use the latest Guava version, this is most often not necessarily working. Instead use the version 15.0 as a short-term solution. (Guava changes public API quite often, and for that newer versions are often not compatible.)

To make that work:

add a dependency to Guava 15 into your all/pom.xml file.
add an additional <embed> statement to the filevault-package-maven-plugin in the all/pom.xml; it should resemble the same pattern, as you embed your own core bundle.

A more long-term solution is to remove the dependencies to Guava bit by bit; in many cases this removal should be easily possible, as the JRE adopted much of the functionality Guava provides.

Everything else

With the above 2 steps for ACS AEM Commons and Guava you should be able to address a large portion of the most urgent deprecation. Nevertheless there might be other dependencies which show up.

Custom code: this should be the easiest solution, as you have full control over it. In many cases the problem is very localized, and should be easy to address. For example I don’t know why anyone would require a direct logback dependency; while there might be valid usecases for it, I think that in many cases this is caused by a simple package import in a java class, which is not used at all. Simply removing that package import could fix this problem, without any functional change required.
3rd party dependencies: Much harder to solve, as you cannot change it. As a first step I would check if there are any updates available which fix this behavior. If not, get in contact with the vendor/provider and seek guidance. And let Adobe know if the time window for a fix is clashing with the official API deprecation schedule by Adobe.

In general I believe that much of this work should be straight forward.

Updating Maven dependencies

The topic of “dependencies” is on the top of mind for many. As I see a lot of these questions coming up, I want to share a few steps which make sense to follow when you need to update your dependencies.

Ensure that your AEM SDK version you are referencing is recent. It will ensure that your code has access to the latest libraries AEM CS ships with.
Run mvn versions:display-dependency-updates to display all available updates to dependencies. Make sure to update all libraries which you added and which AEM does not come out-of-the-box.
And while you are at it, you can run mvn versions:display-plugin-updates as well to see what updated plugin versions are available.

And when you are it: You can also update your Java build toolchain to use a more recent Java version, see the documentation how to do it. Technically it’s not yet required, but even if there’s not ETA yet, the time will come when Java 8 and Java 11 will no longer be supported as build versions.

AEM & Java 21

It’s 2026 and with it we start the year of Java 21 in AEM.

In 2025 we saw the migration of (almost) all customer of the AEM CS platform from a Java 11 runtime to Java 21. And on February 9th the support for any older runtime version will be officially removed, both from the instances in the cloud, but also for the SDK builds. That means that for your local development you will need to use Java 21 as a runtime as well.

And while the build environments *can* still run with Java 11 (or even Java 8), there is always the chance that updated build-time dependencies might pull in the need to update the build-time java version as well. While in most cases such an update works flawlessly, there are a few test frameworks which need updating; for example you might need to update your Mockito version (if you are still using 1.x that will be a bit of work! Did that on a few codebases…) and such.

But honestly, that’s all worth the benefits of using Java 21. Because it gives 2 main benefits:

The improved performance of Java 21 in the build process itself; my personal experience is that it can reduce the duration of build-time processe (especially for the unit-tests) to 50%, which is significant.
It unlocks the capabilities of the Java 21 language features; might be insignificant to many, but there are nice things included (records?)

The release notes also mention that some time in the future the support for a Java 8 and Java 11 build-time environment will be removed. So better be prepared for that and add the topic of “Updating build environmen to Java 21” to your backlog for 2026.

Writing backwards compatible software

On last week’s adaptTo() conference I discussed the topic of AEM content migration with a few attendees, and why it’s a tough topic. I learned that the majority of these adhoc-migrations are done, because they are mandated by changes in the components themselves. And therefor migrations are required to adhere to the new expectations of the component. My remark “can’t you write your components in a way, that they are backwards compatible” was not that well received … it seems that it this is a hard topic for many.

And yes, writing backwards compatible components is not easy, because it comes with a few prerequisites:

The awareness, that you are making a change, which breaks compatibility with existing software and content. While the breakages in “software” can be detected easily, the often-times very loose contract between code and content is much harder to enforce. With some experience in that area you will develop a feeling for that, but especially less experienced folks can make such changes inadvertently, and you will detect that problem way too late.
You need to have a strategy which tells how to handle such a situation. While the AEM WCM Core Components introduced a versioning model, which seems to work quite nicely, an existing codebase might not be prepared for this. It forces some more structure and thoughts how to design your codebase, especially when it then comes to Sling Models and OSGI services, and where to have logic, so you don’t duplicate it.
And even if you are prepared for this situation, it’s not for free, you will end up with new versions of components which you need to maintain. Just breaking compatibility is much easier, because you still will have just 1 component.

So I totally get if you don’t care about backwards compatibility at all, because you are in the end the only consumer of your code, and you can control everything. You are not a product developer, where backwards compatibility needs to have a much higher priority.

But backwards compatibility gives you one massive benefit, which I consider as quite important: It gives you the flexibility to perform a migration to a time which is a good fit. It’s not that you need to perform this migration before, in the midst or immediately after a deployment. You deploy the necessary code, and then migrate thecontent when its convenient. And if that migration date is pushed further for whatever reason, it’s not a problem at all, because this backwards compatibility allows you to decouple the technical aspect of it (the deployment) from the actual execution of the content-migration. And for that you don’t need to re-scope deployments and schedules.

So maybe this is just me with the hat of a product developer, who is so focused on backwards compatibility. And in the wild the cost over backwards-compatibility is much higher than the flexibility it allows. I don’t know. Leave me a comment if you want to share your opinion.

Why I would deprecate InjectionStrategy.OPTIONAL for Sling Models

Sling Models offer a very convenient abstraction, as they allow data from the repository being mapped into fields of Java POJO classes. One feature I find often used is the optional InjectionStrategy. By default if an injection is not working, the instantiation of the POJO fails. When the InjectionStrategy.OPTIONAL field is set in the model annotation (see the Sling docs), such a non-working injection will not fail the creation of the model, but instead the field is left with the default value of the respective type. Which is null for Strings and other complex types. And this setting is valid for the entire class, so when you want to write reliable code, you would have to assume that every injected String property could be null.

This comes with a few challenges, because now you can’t rely anymore on values being non-null, but you would need to test each field if a proper value has been provided. Which is sometimes done, but in the majority of cases it is just assumed, that the field is non-null.

I wonder, why this is done at all. Because normally you write your components in a way that the necessary properties are always available. And if you operate with defaults, you can guarantee with several ways that they are available as soon as the component is being created and authored for the very first time. And while for a few cases a missing property must be dealt with for whatever reason, it is never justified to treat all property injections as optional. Because that would mean, that this sling model is supposed to make sense of almost any resource it is adapted from. And that won’t work.

And if a property is really optional: we added some time back the feature to use something like this (if you really can’t give a default value, which would be a much better choice):

@ValueMapValue
Optional<String> textToDisplay;

With this you can express the optionality of this value with the Java type system, and in that case it’s quite unlikely to miss the validation.

But if it would be just be up to me, I would deprecate InjectionStrategy.OPTIONAL and ban it, because it’s one of the most frequent reasons for NullPointer exceptions in AEM.

I know that using InjectionStrategy.OPTIONAL saves you from asking yourself “is this property always present?”, but that’s a very poor excuse. Because with just a few more seconds of work you can make your Sling Model more robust by just providing default values for every injected field. So please:

Avoid using optional injections when possible!
When it’s required use the Optional type to express it!
Don’t use InjectionStrategy.OPTIONAL!

Using “optional” (in all cases) can also come with a performance impact when used with the generic @Inject annotation; for that read my earlier blog posts on the performance of Sling Models: Sling Model Performance.

How not to do content migrations

(Note: This post is not about getting content from environment A to B or from your AEM 6.5 to AEM CS.)

The requirements towards content and component structure evolve over time; the components which you started initially with might not be sufficient anymore. For that reasons the the components will evolve, they need new properties, or components need to be added/removed/merged, and that must be reflected in the content as well. Something which is possible to do manually, but which will take too much work and is too error-prone. Automation for the rescue.

I already came across a few of those “automated content migrations”, and I have found a few patterns which don’t work. But before I start with them, let me briefly cover the one pattern, which works very well.

The working approach

The only working approach is a workflow, which is invoked on small-ish subtrees of your content. It skips silently over content which does not need to be migrated, and reports every situation which got migrated. It might even have a dry-run mode, which just reports everything it would change. This approach has a few advantages:

It will be invoked intentionally on author only, and only operates a single, well-defined subtree of content. It logs all changes it does.
It does not automatically activate every change it has done, but requires activation as a dedicated second step. This allows to validate the changes and activate it only then.
If it fails, it can repeatedly get invoked on the same content, and continue from were it has left.
It’s a workflow, with the guarantees of a workflow. It cannot time out as a request can do, but will complete eventually. You can either log the migration output or store it as dedicated content/node/binary data somewhere. You know when a subtree is migrated and you can prove that it’s completed.

Of course this is not something you can simply do, but it requires some planning in both designing, coding and the execution of the content migration.

Now, let’s face the few things which don’t work.

Non-working approach 1: Changing content on the fly

I have seen page rendering code, which tries to modify the content it is operating on, removing old properties, adding new properties either with default values and other values.

This approach can work, but only if the user has write permissions on the content. As this migration happens at the first time the rendering is initiated with write permissions (normally by a regular editor on the authoring system), it will fail in every other situation (e.g on publish if the merging conditions exist there as well). And you will have a non-cool mix of page rendering and content-fixup code in your components.

This is a very optimistic approach, over which you don’t have any control, and for that reason you probably can never remove that fixup code, because you never know if all content has already been changed.

Non-working approach 2: Let’s do it on startup

Admitted, I have seen this only once. But it was a weird thing, because a migration OSGI service was created, which executed the content migration in its activate() method. And we came across it because this activate delayed the entire startup to a situation, which caused our automation to run into a timeout, because we don’t expect a startup of an AEM instance to take 30+ minutes.

Which is also its biggest problem and which makes it unusable: You don’t have any control over this process, it can be problematic in the case of clustered repositories (in AEM CS authoring) and even if the migration has already been completed, the check if there’s something to do can take quite long.

But hey, when you have it already implemented as service, it’s quite easy to migrate it to a workflow and then use the above recommended approach.

Let me know if you have found other cases of working or non-working approaches for content migration; but in my experience it’s always the best way to make this an explicit task, which can be planned, managed and properly executed. Everything else can work sometimes, but definitely with a less predictable outcome.

SQL injection in AEM?

TL;DR While SQL injection in AEM is less a problem than in other web frameworks, it should not be ignored. Because being able to read and extract content can pose a problem. For that reason review your code and your permission setup.

When you follow the topic of software security, you are probably well aware of the problem of “SQL injection“, a situation in where an attacker can control (parts of) a SQL command which his sent to a database. In one way or another, this SQL injection is part of the OWASP Top 10 issues for a really long time. And even if almost all application frameworks have built-in ways to mitigate it, it’s still a problem.

From a highlevel perspective, AEM-based applications can also be affected by SQL injections. But due to its design, the impact is less grave:

JCR SQL/XPath or the QueryBuilder are just query languages, and they don’t support any kind of operations which create or modify content.
These queries always operate on top of a JCR session, which implements resource based access control on top of principals (users and groups).

These 2 constraints limit the impact of any type of “SQL injection” (in which an attacker can control parts of a query), because as an attacker you can only retrieve content which the principal you are impersonating has read access to. For that reason a properly designed and implemented permission setup will prevent that any sensitive data, which should not be accessible to that principal, can be extracted; and modifications are not possible at all.

Nevertheless, SQL injection is possible. I frequently see code, in which parameters are passed with a requests, which are not validated and checked, but instead are passed unfiltered as repository path into queries or other API calls. Of course this will cause exceptions or NPEs if that repository path is accessible to the session associated with that request. But if that user session has read access to more data than it actually needs (or even uses a privileged session which has even more access to content), an attacker can still access and potentially extract content which the developers have not planned for.

So from a security point of view you should care about SQL injection also in AEM:

Review your code and make sure that it does proper input validation.
Review your permission setup. Especially check for the permissions of the anonymous user, as this user is used for the non-authenticated requests on publish.
Make sure that you use service users only for their intended purposes. On the other hand, the security gain by service-users is very limited, if code invoked and parameterized by an anonymous request executes a query with a service-user on restricted content only accessible to this service-user. In that case you can make that restricted content readable directly to the anonymous user and it would not be less secure.

AEM CS: Java 21 update

After a lengthy preparation period, this year the rollout of Java 21 will start for AEM as a Cloud Service. While the public documentation contains all relevant information (and I don’t want to reiterate them here), I want to make a few things more clear.

First, this is the update of the Java version used to run AEM as a Cloud Service. This version can be different from the Java version which is used to build the application. As Java versions are backwards compatible and can read binaries created by older versions, it is entirely possible to run the AEM CS instance with Java 21, but still build the application with Java 11. Of course this restricts you to the language features of Java 11 and for example you cannot use Records, but besides that there is no negative impact at all.

This scenario is fully supported; but at some point you need to update your build version to a newer Java version, as freshly added APIs might use Java features which are not available in Java 11. And as a personal recommendation I would suggest to switch also your build time Java version to Java 21.

This change of the runtime Java version should in most cases be totally invisible for you; at least as long as you don’t use or add 3rd-party libraries, which need to support new Java versions explicitly; the most prominent libraries in the AEM context are Groovy (often as part of the Groovy console) and the ASM library (a library which allows to create and modify Java bytecode). If you deploy one of these into your AEM instance, make sure that you update these to a version which supports Java 21.

JCR queries with large result sets

TL;DR: If you expect large result sets, try to run that query asynchronously and not in a request; and definitely pay attention to the memory footprint.

JCR queries can be a tricky thing in AEM, especially when it comes to their performance. Over the years practices have emerged, with the most important of them being “always use an index”. You can find a comprehensive list of recommendations in the JCR Query cheat sheet for AEM.

There you can also find the recommendation to limit the size of the result set (it’s the last in the list); while that can definitely help if you need just 1 or a handful of results, this recommendation is void if you need to compute all results of a query. And that situation can get even worse if you know that this result set can be large (like thousands or even tens of thousands of results).

I have seen that often, when content maintenance processes were executed in the context of requests, which took many minutes in an on-prem setup, but then failed on AEM CS because of the hard limit of 60 seconds for requests.

Large result sets come with their own complexities:

Iterating through the entire result set requires ACL checking plus the proper conversion into JCR objects. That’s not for free.
As the query engine puts a (configurable) read limit to a query, it can have a result set of at maximum 100k nodes by default. This number is the best case, because any access to the repository to post-filter the result delivered by the Lucene index also counts towards that number. If you cross that limit, reading the result set will terminate with an exception.
The memory consumption: While the JCR queries provide an iterator to read the result set, the QueryBuilder API provides API which read the entire result set and return it as a list (SearchResult.getHit()). If this API is used, just the result set can consume a significant amount of heap.
And finally: what does the application do with the result set? Is it performing an operating with each result individually and then does not the single result anymore? Or does it read each result from the query, performs some calculations and stores them again in a list/array for the next step of processing. Assuming that you have 100k querybuilder Hits and 100k custom objects (potentially even referencing the Hit objects), that can easily lead to a memory consumption in the gigabytes.
And all that could happen in parallel.

In my experience all of these properties of large result sets mandate that you run such a query asynchronously, as it’s quite possible that this query takes tens of seconds (even minutes) to complete. Either run it as a Sling Job or using a custom Executor in the context of an OSGI service, but do not run them in the context of request, as in AEM CS this request has the big chance to time out.