Handling Campaign traffic in AEM

Server racks with brightly lit fiber optic cables representing data flow and a technician monitoring data on multiple screens

It must have been 2007 when I have seen that urls with a query string “utm_id=someHexCode” in the logs of the Communiqe system I ran at that time. I still remember that we had about 4’000 of them on any given day, which was not that much of a problem. But I didn’t know back then, that we will still deal with the very same requests more than 15 years later, but with an even higher rate and with more severe consequences.

What is special with these query strings? The most important thing forbackend folks is that these query strings are a frontend topic. They are used to attribute requests to a certain source, which is important for the Analytics folks to track the effectiveness for their campaigns.
For example when there is this query string “utm_id=cm2026-35-1” it could be the code of “email blast 1 of campaign 35 in 2026”. If a user clicks on that link in an email, the analytics code in the page will read this query string and report it to the analytics server. And this then allows to track the conversion rate or efficiency of this particular email blast and compare it to the results of a Facebook ad or other sources.

So this special type of traffic has 2 aspects which are important for backend folks like me:

It typically happens in spikes: Right after the distribution of these emails (either via ads, emails or whatever other way of distributing it) users will click it.
These query srings have a meaning only on the frontend side, but on the backend these parameters are not used at all.

But as most caches don’t cache any response where the query string contains a query string, such requests bypass all CDN caches by default. That means that such requests end up very frequently on the AEM publish instance for rendering, while from a backend perspective all of the following requests will produce the same results:

/content.html (the response of this request could be cached)
/content.html?utm_id=campaign1 (non-cacheable)
/content.html?utm_id=campaign25 (non-cacheable)

And because these requests happen frequently in spikes, this often leads to situations that such a campaign triggers an overload of the AEM Publish layer. Which is sad, because your expensive and successful marketing campaign is responsible for a server-side outage, and instead of a great experience you serve your customers a slow site and/or errors.

Unfortunately I see too many of those situations.

What is the AEM answer to it?

The general idea to handle this situation is to strip off these campaign parameters from the request, which makes turns them into requests for which the response can be served from a cache, where the usual caching and expiration rules are applied. Where and how this can be done is depends.

On AEM CS the best way to handle this directly on the CDN (using Traffic Rules to normalize requests); this is the best solution, because any campaign traffic is served directly from the CDN, and it’s not bothering origin (that means the dispatcher and the publish instances). If you are on AEM CS you should use this approach.

A concept which is can be implemented on any AEM setup is to implement it on the dispatcher. With the /ignureUrlParams command you specifiy the parameters which should be stripped from the request. If there is no query string left, the request is considered cacheable, and it’s checked against the usual dispatcher rules.

But in every case you need to be able to identify the query strings which you know will be used in the context of your AEM application. If you know them, you also know that you can ignore everything else. Configure them into the traffic rules or the /ignoreUrlParams section.

Every AEM instance should have this configured in order to handle such traffic spikes.

AEM CS API deprecations

You might have received alerts already, that your application is using deprecated APIs, and that you should act now. Many of these warnings are out there for quite a bit of time, but have never been enforced. Looks like now it’s the time.

As I handled a few of these cases and warnings already, a few words to them.

Use the aemanalyser-maven-plugin

To speedup the feedback cycle, use the aemanalyser-maven-plugin to perform the same validation the fullstack pipeline would do. Execute the following steps to make it work:

Add a dependency to the aemanalyser-maven-plugin to your all/pom.xml. Make sure that you are using the latest version 1.6.16.
Configure it to execute the project-analyse goal (also in the all/pom.xml); see here how it’s done the AEM archetype

Strictly speaking this is not necessary in the context of the API deprecation process, but having it will make your life much easier.

Update ACS AEM Commons

ACS AEM Commons is a library which is used by many AEM CS customers. Updating it to the version 6.11.0 (or later) makes all warnings go away. Just make sure that you use the “cloud” classifier. And if you are still referencing or embedding the “twitter” module, remove it, as it has not been update for a long time, and I doubt that it is still working.

Guava

If you have a reference to the use of the deprecated API packages “com.google.*”, these are caused by the upcoming removal of Guava from the AEM public API. While the official documentation mentions to use the latest Guava version, this is most often not necessarily working. Instead use the version 15.0 as a short-term solution. (Guava changes public API quite often, and for that newer versions are often not compatible.)

To make that work:

add a dependency to Guava 15 into your all/pom.xml file.
add an additional <embed> statement to the filevault-package-maven-plugin in the all/pom.xml; it should resemble the same pattern, as you embed your own core bundle.

A more long-term solution is to remove the dependencies to Guava bit by bit; in many cases this removal should be easily possible, as the JRE adopted much of the functionality Guava provides.

Everything else

With the above 2 steps for ACS AEM Commons and Guava you should be able to address a large portion of the most urgent deprecation. Nevertheless there might be other dependencies which show up.

Custom code: this should be the easiest solution, as you have full control over it. In many cases the problem is very localized, and should be easy to address. For example I don’t know why anyone would require a direct logback dependency; while there might be valid usecases for it, I think that in many cases this is caused by a simple package import in a java class, which is not used at all. Simply removing that package import could fix this problem, without any functional change required.
3rd party dependencies: Much harder to solve, as you cannot change it. As a first step I would check if there are any updates available which fix this behavior. If not, get in contact with the vendor/provider and seek guidance. And let Adobe know if the time window for a fix is clashing with the official API deprecation schedule by Adobe.

In general I believe that much of this work should be straight forward.

Updating Maven dependencies

The topic of “dependencies” is on the top of mind for many. As I see a lot of these questions coming up, I want to share a few steps which make sense to follow when you need to update your dependencies.

Ensure that your AEM SDK version you are referencing is recent. It will ensure that your code has access to the latest libraries AEM CS ships with.
Run mvn versions:display-dependency-updates to display all available updates to dependencies. Make sure to update all libraries which you added and which AEM does not come out-of-the-box.
And while you are at it, you can run mvn versions:display-plugin-updates as well to see what updated plugin versions are available.

And when you are it: You can also update your Java build toolchain to use a more recent Java version, see the documentation how to do it. Technically it’s not yet required, but even if there’s not ETA yet, the time will come when Java 8 and Java 11 will no longer be supported as build versions.

AEM & Java 21

It’s 2026 and with it we start the year of Java 21 in AEM.

In 2025 we saw the migration of (almost) all customer of the AEM CS platform from a Java 11 runtime to Java 21. And on February 9th the support for any older runtime version will be officially removed, both from the instances in the cloud, but also for the SDK builds. That means that for your local development you will need to use Java 21 as a runtime as well.

And while the build environments *can* still run with Java 11 (or even Java 8), there is always the chance that updated build-time dependencies might pull in the need to update the build-time java version as well. While in most cases such an update works flawlessly, there are a few test frameworks which need updating; for example you might need to update your Mockito version (if you are still using 1.x that will be a bit of work! Did that on a few codebases…) and such.

But honestly, that’s all worth the benefits of using Java 21. Because it gives 2 main benefits:

The improved performance of Java 21 in the build process itself; my personal experience is that it can reduce the duration of build-time processe (especially for the unit-tests) to 50%, which is significant.
It unlocks the capabilities of the Java 21 language features; might be insignificant to many, but there are nice things included (records?)

The release notes also mention that some time in the future the support for a Java 8 and Java 11 build-time environment will be removed. So better be prepared for that and add the topic of “Updating build environmen to Java 21” to your backlog for 2026.

How not to do content migrations

(Note: This post is not about getting content from environment A to B or from your AEM 6.5 to AEM CS.)

The requirements towards content and component structure evolve over time; the components which you started initially with might not be sufficient anymore. For that reasons the the components will evolve, they need new properties, or components need to be added/removed/merged, and that must be reflected in the content as well. Something which is possible to do manually, but which will take too much work and is too error-prone. Automation for the rescue.

I already came across a few of those “automated content migrations”, and I have found a few patterns which don’t work. But before I start with them, let me briefly cover the one pattern, which works very well.

The working approach

The only working approach is a workflow, which is invoked on small-ish subtrees of your content. It skips silently over content which does not need to be migrated, and reports every situation which got migrated. It might even have a dry-run mode, which just reports everything it would change. This approach has a few advantages:

It will be invoked intentionally on author only, and only operates a single, well-defined subtree of content. It logs all changes it does.
It does not automatically activate every change it has done, but requires activation as a dedicated second step. This allows to validate the changes and activate it only then.
If it fails, it can repeatedly get invoked on the same content, and continue from were it has left.
It’s a workflow, with the guarantees of a workflow. It cannot time out as a request can do, but will complete eventually. You can either log the migration output or store it as dedicated content/node/binary data somewhere. You know when a subtree is migrated and you can prove that it’s completed.

Of course this is not something you can simply do, but it requires some planning in both designing, coding and the execution of the content migration.

Now, let’s face the few things which don’t work.

Non-working approach 1: Changing content on the fly

I have seen page rendering code, which tries to modify the content it is operating on, removing old properties, adding new properties either with default values and other values.

This approach can work, but only if the user has write permissions on the content. As this migration happens at the first time the rendering is initiated with write permissions (normally by a regular editor on the authoring system), it will fail in every other situation (e.g on publish if the merging conditions exist there as well). And you will have a non-cool mix of page rendering and content-fixup code in your components.

This is a very optimistic approach, over which you don’t have any control, and for that reason you probably can never remove that fixup code, because you never know if all content has already been changed.

Non-working approach 2: Let’s do it on startup

Admitted, I have seen this only once. But it was a weird thing, because a migration OSGI service was created, which executed the content migration in its activate() method. And we came across it because this activate delayed the entire startup to a situation, which caused our automation to run into a timeout, because we don’t expect a startup of an AEM instance to take 30+ minutes.

Which is also its biggest problem and which makes it unusable: You don’t have any control over this process, it can be problematic in the case of clustered repositories (in AEM CS authoring) and even if the migration has already been completed, the check if there’s something to do can take quite long.

But hey, when you have it already implemented as service, it’s quite easy to migrate it to a workflow and then use the above recommended approach.

Let me know if you have found other cases of working or non-working approaches for content migration; but in my experience it’s always the best way to make this an explicit task, which can be planned, managed and properly executed. Everything else can work sometimes, but definitely with a less predictable outcome.

JCR queries with large result sets

TL;DR: If you expect large result sets, try to run that query asynchronously and not in a request; and definitely pay attention to the memory footprint.

JCR queries can be a tricky thing in AEM, especially when it comes to their performance. Over the years practices have emerged, with the most important of them being “always use an index”. You can find a comprehensive list of recommendations in the JCR Query cheat sheet for AEM.

There you can also find the recommendation to limit the size of the result set (it’s the last in the list); while that can definitely help if you need just 1 or a handful of results, this recommendation is void if you need to compute all results of a query. And that situation can get even worse if you know that this result set can be large (like thousands or even tens of thousands of results).

I have seen that often, when content maintenance processes were executed in the context of requests, which took many minutes in an on-prem setup, but then failed on AEM CS because of the hard limit of 60 seconds for requests.

Large result sets come with their own complexities:

Iterating through the entire result set requires ACL checking plus the proper conversion into JCR objects. That’s not for free.
As the query engine puts a (configurable) read limit to a query, it can have a result set of at maximum 100k nodes by default. This number is the best case, because any access to the repository to post-filter the result delivered by the Lucene index also counts towards that number. If you cross that limit, reading the result set will terminate with an exception.
The memory consumption: While the JCR queries provide an iterator to read the result set, the QueryBuilder API provides API which read the entire result set and return it as a list (SearchResult.getHit()). If this API is used, just the result set can consume a significant amount of heap.
And finally: what does the application do with the result set? Is it performing an operating with each result individually and then does not the single result anymore? Or does it read each result from the query, performs some calculations and stores them again in a list/array for the next step of processing. Assuming that you have 100k querybuilder Hits and 100k custom objects (potentially even referencing the Hit objects), that can easily lead to a memory consumption in the gigabytes.
And all that could happen in parallel.

In my experience all of these properties of large result sets mandate that you run such a query asynchronously, as it’s quite possible that this query takes tens of seconds (even minutes) to complete. Either run it as a Sling Job or using a custom Executor in the context of an OSGI service, but do not run them in the context of request, as in AEM CS this request has the big chance to time out.

Monitoring Java heap

Every now and then I get the question: “What do you think if we alert at 90% heap usage of AEM?”. The answer is always longer, so I write it down here for easier linking.

TL;DR: Don’t alert on the amount of used heap, but only on garbage collection.

Java is language which relies on garbage collection (GC). Unlike other programming languages memory is managed by the runtime. The operator assigns a certain amount of RAM to the java process for usage, and that’s it. A large fraction of this RAM goes into the heap, and the Java Virtual machine (JVM) manages this heap entirely on its own.

Now, as every good runtime, the JVM is lazy and does work only when it’s required. That means it will start the garbage collection only when then the amount of free memory is low. This is probably over-simplified, but good enough for the purpose of this article.

That means that the heap usage metrics show that the heap usage is approaching 100%, and then it suddenly drops to a much lower value, because the garbage collection process just released memory which is no longer required. And then the garbage collection pauses and the processing goes on, consuming memory, until at some point the garbage collection starts again. This leads to the typical saw-tooth pattern of the JVM.

(source: Interesting Garbage Collection Patterns by Ram Lakshamanan)

For that reason it’s not helpful to use the heap usage as alerting metric, as it fluctuates too much, and it will alert you when the actual memory usage is already down.

But of course there are other situations, where the saw-tooth pattern gets less visible, as the garbage collection can release less memory with each run, and that can indeed point to a problem. How can this get measured?

In this scenario the garbage collection runs more frequently, and the less the garbage collection releases, the more often it runs, until the entire application is effectively stopped and only the garbage collection is running. That means that here you can use the amount of the time the garbage collector runs per time period. Anything below 5% is good, and anything beyond 10% is a problem.

For that reason, rather measure the garbage collection, as it is a better indicator if your heap is too small.

Delivering dynamic renditions

One of the early features of ACS AEM Commons was the Named Image Transformer as part of the release 1.5 of 2014. This feature allowed you to transform image assets dynamically with a number of options, most notable the transformation into different images dimensions to match the requirements of the frontend guidelines. This feature was quite popular and in a stripped-down scope (it does not support all features) it also made it into the WCM Core Components (called the AdaptiveImageServlet).

This feature is nice, but it suffers from a huge problem: This transformation is done dynamically on request, and depending on the image asset itself it can consume a huge amount of heap memory. The situation gets worse when many of such requests are done in parallel, and I have seen more than once situations of AEM publish instances ending up in heavy garbage collection situations, ultimately leading to crashes and/or service outages.

This problem is not really new, as pretty much the same issue also happens on asset ingestion time, when the predefined renditions are created. While on AEM 6.5 the standard solution was to externalize to this problem for asset ingestion (hello Imagemagick!), and AEM CS solved this challenge in a different and more scalable way using AssetCompute. But both solutions did not address the problem of enduser requests to these dynamic renditions, this is and was still done on request in the heap.

We have implemented a number of improvements in the AdaptiveImageServlet to improve the situation:

A limit for requested dimensions was added to keep the memory consumption “reasonable”.
The original rendition is necessarily used as a basis to render the image in the requested dimension, but rather the closest rendition, which can satisfy the requirements of the requested parameters.
An already existing rendition is delivered , if its dimensions and image format is requested.
An upcoming improvement for the AdaptiveImageServlet on AEM CS is to deliver these renditions directly from the blobstore instead of streaming the binary via the JVM.

This improves the situation already, but there are still customers and cases, where images are resized dynamically. For these users I suggest to make the these changes:

Compile a list of all required image dimensions which you need in your frontend.
And then define matching processing profiles, so that whenever such a rendition is requested via the AdaptiveImageServlet it can be served directly from an existing rendition.

That works without changes in your codebase and will improve the delivering of such assets.

And for the users of the Named Image Transformer of ACS AEM Commons I suggest to rethink the usage of it. Do you really use all of its features?

Restoring deleted content

I just wrote about backup and restore in AEM CS, and why backups cannot serve as a replacement for an archival solution. But instead it’s just designed as a precaution for major data loss and corruption.

But there is another aspect to that question: what about deleted content? Is requesting a restore the proper way to handle these cases?

Assume that you have accidentally deleted an entire subtree of pages in your AEM instance. From a functional point of view you can perform a restore to a time before this deletion of content. But that means that a rollback of the entire content is made, which means that not only this deleted content is restored, but also other changes which performed since that time would be undone.

And depending on the frequency of activities and the time you would need to restore this can be a lot. And you would need to perform all these changes again to catch-up.

The easiest way to handle such cases is to use the versioning features of AEM. Many activities trigger the creation of a version of a page, for example when you activate it, when you delete it via the UI; you can also manually trigger the creation of a version. To restore one page or even an entire subtree you can use the “Restore” and “Restore Tree” features of AEM (see the documentation).

In earlier versions of AEM versions have not been created for Assets by default, but this has changed in AEM CS; now versions are created for assets pretty much as they are creted for pages by default. That means you can use the same approach and restore versions of assets via the timeline (see the documentation).

With the proper versioning in place, most if not all of such accidental deletions or changes can be handled; this is the preferred approach to handle it, because it can be executed by regular users and does not have an impact on the rest system of the system by rolling back really all changes. And you don’t have any downtime on authoring instances.

For that reason I recommend you to work as much as possible with these features. But there are situations, where the impact is that severe that you rather want to roll back everything than restoring things through the UI. In that situation a restore is probably the better solution.

AEM CS Backup, Restores and Archival

One recurring question I see in the Adobe internal communication channels is like this: “For our customer X we need to know how long Adobe stores backups for our CS instances”.

The obvious answer to this is “7 days” (see the documentation) or “3 months” (for Offsite backup), because the backup is designed only to handle cases of data corruption of the repository. But in most cases there is a followup question “But we need access to backup data up to 5 years”. Then it’s clear that this question is not about backup, but rather about content archival and compliance. And that’s a totally different question.

TL;DR

When you need to retain content for compliance reasons, my colleagues are happy to discuss the details with you. But increasing the retention period for your backups is not a solution for it.

Compliance

So what does “content archival and compliance” mean in this situation? For regulatory and legal reasons some industries are required to retain all public statements (including websites) for some time (normally 5-10 years). And of course the implementation of that is up to the company itself. And it seems quite easy to implement an approach which holds the backups for up these 10 years around.

Some years back I spent some time on the drawing board to design a solution for an AEM on-prem customer; their requirement was to be able to prove what at any time within these 10 years was displayed to customers on their website.
We initially also thought about keeping backups around for 10 years; but then we came up with these questions:

When the content is required, a restore from that backup would be required to an environment which can host this AEM instance. Is such an environment (servers, virtual machines) available? How much of these environments would be required, assuming that this instance would be required to run for some months (throughout the entire legal process which requires content from that time)?
Assuming that an 8y old backup must be restored, are there still the old virtual machine images with Redhat Linux 7 (or whatever OS) around? Is it okay from a compliance perspective to run these old and potentially unsupported OS versions even in a secured network environment? Is the documentation still around which describes to install all of that? Does your backup system still support a restore to such an old OS version?
How would you authenticate against such an old AEM version? Would you require your users to have their old passwords at hand (if you authenticate against AEM), or does your central identity management still support the interface this old AEM version is trying for authentication?
As this is a web page, is it ensured that all external references, which are embedded into the page are also available? Think about the Javascript and CSS libraries, which are often just pulled from their respective CDN servers.
How frequently must a backup be stored? Is it okay and possible to store just the authoring instance every quarter and do not perform any cleanup (version cleanup, workflow purge, …) in that time and have all content changes versioned, so you can use the restore functionality to go back to the requested time? Or do you need to store a backup after each deployment, because each deployment has the chance to change the UI and introduce backwards incompatible changes, which render the restored content not to work anymore? And would you need to archive the publish instance as well (where normally no versions are preserved)? And are you sure that you can trust the AEM version storage enough, so you can rely on JCR versioning to recreate any intermediary states between those retained backups?
When you design such a complex process, you should definitely test the restore process regularly.
And finally: What are the costs of such a backup approach? Can you use the normal backup storage, or do you need a special solution which guarantees that the stored data cannot be tampered with?

You can see that the list of questions is long. I don’t say it is impossible, but it requires a lot of work and attention to detail.

In my project the deal breaker was the calculated storage cost (we would have required a dedicated storage, as the normal backup storage did not provide the required guarantees for archival purposes). So we decided to take a different approach, and we added a custom process which creates a PDF/A out of every activated page and stores it in the dedicated archival solution (assets are stored as is). This adds upfront costs (a custom implementation), but is much cheaper on the long run. And on top if it does not need IT to access the old version of the homepage of January 23, 2019; but instead the business users or legal can directly access the archive and fetch the respective PDF of the time they are interested in.

In AEM CS the situation is a bit different, because the majority of the questions above deal with “old AEM vs everything else around is current”, and many aspects are not relevant for customers anymore; they are in the domain of Adobe instead. But I am not aware that Adobe ever planned to setup such a time machine, which allows to re-create everything at a specific point in time (besides all implications of security etc), mostly because “everything” is a lot.

So, as a conclusion: Using backups for content archival and compliance is not the best solution. It sounds easy at first, but it raises a lot of question if look into the details. The longer you need to retain these AEM backups, the more likely will it be that inevitable changes in the surrounding environments makes a proper function harder or even impossible.