This was 2024

Wow, another year has passed. Time for a recap.

My personal goal for 2024 in this blog was to post more often and more consistently, and I think that I was successful at that. When I counted correctly, it were 20 posts in 2024. The consistency in the intervals could be better (a few just days apart, other multiple weeks), but unlike in some other years I never really felt, that I was lagging way behind. So I am quite happy with it and will try to do the same in 2025.

This year I adopted 2 ideas from other blogs:

A blog post series, which is planned as such. In January and February I posted 5 posts on Modeling Performance Tests (starting here). This approach worked quite well, mostly because I spent enough time to write them before I made the first post public. If I know upfront that topics are large enough, I will continue with this type.
The “top N things …” type of posts. I don’t particular like this type of posting, because very often they just scream for attention and clicks, without adding much value. I used that approach 2 times (The new AEM CS feature in 2024 I love most and My top 3 reasons why page rendering is slow) ; and then mostly to share links to other pages. It can work that way, but that will never be my favorite type of blog post.

The most successful blog post of 2024: As I did not add any page analytics to this page (I would need a cookie banner then), I have only some basic statistics from WordPress. The top 3 requested pages besides the start page in 2024 were:

CQ development patterns – Sling ResourceResolver and JCR Sessions (written in 2013)
Do not use AEM as proxy for backend calls (of 2024)
How to analyze “Authentication support missing” (of 2023)

Interesting that a 10 year old article was requested most often. Also WordPress showed me that LinkedIn was a significant source of traffic, so I probably should continue to announce blog posts there. (If you think I should also do announcements elsewhere, let me know.)

And just today I saw the latest video from Tad Reeves, where he mentioned my article on performance testing in AEM CS. Thank you Tad, I really appreciate your feedback and the recognition!

That’s for 2024! I wish you all a relaxing break and a successful year 2025!

My top 3 reasons why page rendering is slow

In the past years I was engaged in many performance tuning activities, which related mostly to slow page rendering on AEM publish instances. Performance tuning on authoring side is often different and definitely much harder

And over the time I identified 3 main types of issues, which make the page rendering times slow. And slow page rendering can be hidden by caching, but at some point the page needs to be rendered, and often it makes a difference if this process takes 800ms or 5 seconds. Okay, so let’s start.

Too many components

This is a pattern which I see often in older codebases. Often pages are assembled out of 100+ components, very often in deep nesting. My personal record I have seen were 400 components, nested in 10 levels. This normally causes problems in the authoring UI because you need to very careful to select the correct component and its parent or a child container.

The problem on the page rendering process is the overhead of each component. This overhead consists of the actual include logic and then all the component-level filters. While each inclusion and each component does not take much time, the large number of components cause the problem.

For that reason: Please please reduce the number of components on your page. Not only the backend rendering time, but also the frontend performance (less javascript and CSS rules to evaluate) and the authors experience will benefit from it.

Slow Sling models

I love Sling Models. But they can also hide a lot of performance problems (see my series about optimizing Sling Models), and thus can be a root-cause for performance problems. In the context of page rendering and Sling Models backing HTL scripts, the problem are normally not the annotations (see this post), but rather the complex and time-consuming logic when the models are instantiated, most specifically the problems with executing the same logic multiple times (as described in my earlier post “Sling Model Performance (Part 4)“).

External network connections

This pattern requires that during page rendering a synchronous call is done towards a different system; and while this request is executed the rendering thread on the AEM side is blocked. This can turn into problems if the backend is either slow or not available. Unfortunately this is the hardest case to fix, because removing this often requires a re-design of the application. Please see also my post about “Do not use AEM as a proxy for backend calls” for this; it contains a few recommendations how to avoid at least some of the worst aspects, for example using proper timeouts.

Sling model performance (part 4)

I think it’s time for another chapter in the topic of Sling Model performance, just to document some interesting findings I have recently made in the context of a customer project. If you haven’t read them, I recommend you to check the first 3 parts of this series here:

In this blog post I want to show the impact of inheritance in combination with Sling Models.

Sling Models are simple Java POJOs, and for that reason all features of Java can be used safely. I have seen many projects, where these POJOs inherit from a more or less sophisticated class hierarchy, which often reflect the component hierarchy. These parent classes also often consolidate generic functionality used in many or all Sling Models.

For example many Sling Models need to know the site-root page, because from there on they build links, the navigation, read global properties from etc. For that reason I have seen in many parent classes code like this:

public class AbstractModel {

  Page siteRoot;

  public void init() {
    siteRoot = getSiteRoot();
    // and many more initializations
  }
}

And then this is used like this by a Sling Model called ComponentModel:

public class ComponentModel extends AbstractModel {

  @PostConstruct
  public void init() {
    super();
  }
  ...
}

That’s all straight forward and good. But only until 10 other Sling Models also inherit from the AbstractModel, and all of them also invoke the getSiteRoot() method, which in all cases returns a page object representing the same object in the repository. Feels redundant, and it is. And it’s especially redundant, if a Model invokes the init() method of its parent and does not really need all of the values calculated there.

While in this case the overhead is probably small, I have seen cases where the removal of this redundant code brought down the rendering time from 15 seconds to less 1 second! That’s significant!

For this reason I want to make some recommendations how you can speed up your Sling Models when you use inheritance.

If you want or need to use inheritance, make sure that the parent class has a small and fast init method, and that it does not add too much overhead to each construction of a Sling Model.
I love Java Lambdas in this case, because you can pass them around and only invoke them when you really need their value. That’s ideal for lazy evaluation.
And if you need to calculate values more than once, store them for later reuse
- in the request properties, if you adapt from a request
- or in the ResourceResolver’s propertyMap, if you adapt from a Resource.

Monitoring Java heap

Every now and then I get the question: “What do you think if we alert at 90% heap usage of AEM?”. The answer is always longer, so I write it down here for easier linking.

TL;DR: Don’t alert on the amount of used heap, but only on garbage collection.

Java is language which relies on garbage collection (GC). Unlike other programming languages memory is managed by the runtime. The operator assigns a certain amount of RAM to the java process for usage, and that’s it. A large fraction of this RAM goes into the heap, and the Java Virtual machine (JVM) manages this heap entirely on its own.

Now, as every good runtime, the JVM is lazy and does work only when it’s required. That means it will start the garbage collection only when then the amount of free memory is low. This is probably over-simplified, but good enough for the purpose of this article.

That means that the heap usage metrics show that the heap usage is approaching 100%, and then it suddenly drops to a much lower value, because the garbage collection process just released memory which is no longer required. And then the garbage collection pauses and the processing goes on, consuming memory, until at some point the garbage collection starts again. This leads to the typical saw-tooth pattern of the JVM.

(source: Interesting Garbage Collection Patterns by Ram Lakshamanan)

For that reason it’s not helpful to use the heap usage as alerting metric, as it fluctuates too much, and it will alert you when the actual memory usage is already down.

But of course there are other situations, where the saw-tooth pattern gets less visible, as the garbage collection can release less memory with each run, and that can indeed point to a problem. How can this get measured?

In this scenario the garbage collection runs more frequently, and the less the garbage collection releases, the more often it runs, until the entire application is effectively stopped and only the garbage collection is running. That means that here you can use the amount of the time the garbage collector runs per time period. Anything below 5% is good, and anything beyond 10% is a problem.

For that reason, rather measure the garbage collection, as it is a better indicator if your heap is too small.

Delivering dynamic renditions

One of the early features of ACS AEM Commons was the Named Image Transformer as part of the release 1.5 of 2014. This feature allowed you to transform image assets dynamically with a number of options, most notable the transformation into different images dimensions to match the requirements of the frontend guidelines. This feature was quite popular and in a stripped-down scope (it does not support all features) it also made it into the WCM Core Components (called the AdaptiveImageServlet).

This feature is nice, but it suffers from a huge problem: This transformation is done dynamically on request, and depending on the image asset itself it can consume a huge amount of heap memory. The situation gets worse when many of such requests are done in parallel, and I have seen more than once situations of AEM publish instances ending up in heavy garbage collection situations, ultimately leading to crashes and/or service outages.

This problem is not really new, as pretty much the same issue also happens on asset ingestion time, when the predefined renditions are created. While on AEM 6.5 the standard solution was to externalize to this problem for asset ingestion (hello Imagemagick!), and AEM CS solved this challenge in a different and more scalable way using AssetCompute. But both solutions did not address the problem of enduser requests to these dynamic renditions, this is and was still done on request in the heap.

We have implemented a number of improvements in the AdaptiveImageServlet to improve the situation:

A limit for requested dimensions was added to keep the memory consumption “reasonable”.
The original rendition is necessarily used as a basis to render the image in the requested dimension, but rather the closest rendition, which can satisfy the requirements of the requested parameters.
An already existing rendition is delivered , if its dimensions and image format is requested.
An upcoming improvement for the AdaptiveImageServlet on AEM CS is to deliver these renditions directly from the blobstore instead of streaming the binary via the JVM.

This improves the situation already, but there are still customers and cases, where images are resized dynamically. For these users I suggest to make the these changes:

Compile a list of all required image dimensions which you need in your frontend.
And then define matching processing profiles, so that whenever such a rendition is requested via the AdaptiveImageServlet it can be served directly from an existing rendition.

That works without changes in your codebase and will improve the delivering of such assets.

And for the users of the Named Image Transformer of ACS AEM Commons I suggest to rethink the usage of it. Do you really use all of its features?

Restoring deleted content

I just wrote about backup and restore in AEM CS, and why backups cannot serve as a replacement for an archival solution. But instead it’s just designed as a precaution for major data loss and corruption.

But there is another aspect to that question: what about deleted content? Is requesting a restore the proper way to handle these cases?

Assume that you have accidentally deleted an entire subtree of pages in your AEM instance. From a functional point of view you can perform a restore to a time before this deletion of content. But that means that a rollback of the entire content is made, which means that not only this deleted content is restored, but also other changes which performed since that time would be undone.

And depending on the frequency of activities and the time you would need to restore this can be a lot. And you would need to perform all these changes again to catch-up.

The easiest way to handle such cases is to use the versioning features of AEM. Many activities trigger the creation of a version of a page, for example when you activate it, when you delete it via the UI; you can also manually trigger the creation of a version. To restore one page or even an entire subtree you can use the “Restore” and “Restore Tree” features of AEM (see the documentation).

In earlier versions of AEM versions have not been created for Assets by default, but this has changed in AEM CS; now versions are created for assets pretty much as they are creted for pages by default. That means you can use the same approach and restore versions of assets via the timeline (see the documentation).

With the proper versioning in place, most if not all of such accidental deletions or changes can be handled; this is the preferred approach to handle it, because it can be executed by regular users and does not have an impact on the rest system of the system by rolling back really all changes. And you don’t have any downtime on authoring instances.

For that reason I recommend you to work as much as possible with these features. But there are situations, where the impact is that severe that you rather want to roll back everything than restoring things through the UI. In that situation a restore is probably the better solution.

AEM CS Backup, Restores and Archival

One recurring question I see in the Adobe internal communication channels is like this: “For our customer X we need to know how long Adobe stores backups for our CS instances”.

The obvious answer to this is “7 days” (see the documentation) or “3 months” (for Offsite backup), because the backup is designed only to handle cases of data corruption of the repository. But in most cases there is a followup question “But we need access to backup data up to 5 years”. Then it’s clear that this question is not about backup, but rather about content archival and compliance. And that’s a totally different question.

TL;DR

When you need to retain content for compliance reasons, my colleagues are happy to discuss the details with you. But increasing the retention period for your backups is not a solution for it.

Compliance

So what does “content archival and compliance” mean in this situation? For regulatory and legal reasons some industries are required to retain all public statements (including websites) for some time (normally 5-10 years). And of course the implementation of that is up to the company itself. And it seems quite easy to implement an approach which holds the backups for up these 10 years around.

Some years back I spent some time on the drawing board to design a solution for an AEM on-prem customer; their requirement was to be able to prove what at any time within these 10 years was displayed to customers on their website.
We initially also thought about keeping backups around for 10 years; but then we came up with these questions:

When the content is required, a restore from that backup would be required to an environment which can host this AEM instance. Is such an environment (servers, virtual machines) available? How much of these environments would be required, assuming that this instance would be required to run for some months (throughout the entire legal process which requires content from that time)?
Assuming that an 8y old backup must be restored, are there still the old virtual machine images with Redhat Linux 7 (or whatever OS) around? Is it okay from a compliance perspective to run these old and potentially unsupported OS versions even in a secured network environment? Is the documentation still around which describes to install all of that? Does your backup system still support a restore to such an old OS version?
How would you authenticate against such an old AEM version? Would you require your users to have their old passwords at hand (if you authenticate against AEM), or does your central identity management still support the interface this old AEM version is trying for authentication?
As this is a web page, is it ensured that all external references, which are embedded into the page are also available? Think about the Javascript and CSS libraries, which are often just pulled from their respective CDN servers.
How frequently must a backup be stored? Is it okay and possible to store just the authoring instance every quarter and do not perform any cleanup (version cleanup, workflow purge, …) in that time and have all content changes versioned, so you can use the restore functionality to go back to the requested time? Or do you need to store a backup after each deployment, because each deployment has the chance to change the UI and introduce backwards incompatible changes, which render the restored content not to work anymore? And would you need to archive the publish instance as well (where normally no versions are preserved)? And are you sure that you can trust the AEM version storage enough, so you can rely on JCR versioning to recreate any intermediary states between those retained backups?
When you design such a complex process, you should definitely test the restore process regularly.
And finally: What are the costs of such a backup approach? Can you use the normal backup storage, or do you need a special solution which guarantees that the stored data cannot be tampered with?

You can see that the list of questions is long. I don’t say it is impossible, but it requires a lot of work and attention to detail.

In my project the deal breaker was the calculated storage cost (we would have required a dedicated storage, as the normal backup storage did not provide the required guarantees for archival purposes). So we decided to take a different approach, and we added a custom process which creates a PDF/A out of every activated page and stores it in the dedicated archival solution (assets are stored as is). This adds upfront costs (a custom implementation), but is much cheaper on the long run. And on top if it does not need IT to access the old version of the homepage of January 23, 2019; but instead the business users or legal can directly access the archive and fetch the respective PDF of the time they are interested in.

In AEM CS the situation is a bit different, because the majority of the questions above deal with “old AEM vs everything else around is current”, and many aspects are not relevant for customers anymore; they are in the domain of Adobe instead. But I am not aware that Adobe ever planned to setup such a time machine, which allows to re-create everything at a specific point in time (besides all implications of security etc), mostly because “everything” is a lot.

So, as a conclusion: Using backups for content archival and compliance is not the best solution. It sounds easy at first, but it raises a lot of question if look into the details. The longer you need to retain these AEM backups, the more likely will it be that inevitable changes in the surrounding environments makes a proper function harder or even impossible.

The new AEM CS feature in 2024 which I love most

Pretty much 4 years ago I joined the AEM as a Cloud Service engineering team, and since that time I am working on the platform level as a Site Reliability Engineering. I work on platform reliability and performance and help customers to improve their applications in these aspects.

But that also means, that many features which are released throughout the years are not that relevant for my work. But there are a few ones that matter a lot to me. They allow me to help customers in really good and elegant ways.

In 2024 there was one, which I like very much, and that’s the Traffic Rules feature (next to the custom error page and CDN cache purging as self-service). I like it, because it lets you filter and transform traffic at scale where it can be handled best: At the CDN layer.

Before that feature was available, all traffic handling needed to happen at the dispatcher level. The combination of the Apache httpd and dispatcher rules allowed you to perform all these operations. However, I consider it a bit problematic. Because at that point the traffic already hit the dispatcher instances. It was already in your datacenter, on your servers.

To mitigate that, many customers (both onprem/AMS or AEM CS) purchased a WAF solution to handle specifically these cases. But now with the traffic rules every AEM CS customers gets a new set of features which they can use to handle traffic on the CDN level.

The documentation is quite extensive and contains relevant examples, showcasing the ways how you can block, ratelimit or transform traffic to your needs:

The most compelling reason I rate this as my top feature this year is really the traffic transformation feature.

A part of my daily job is to help customers to prepare their AEM CS instances to handle their traffic spikes. Besides all the tunings on the backend, the biggest angle to improve this sutuation is to handle all these requests at the CDN. Because then it’s not hitting the backend at all.

A constant problem in that situation are request parameters which are added by campaigns. You might know the “utm*”, “fbclid” or “gclid” query parameters when traffic comes to your site which was clicked either on Facebook or Google. And there are many more. Analytics tool need these parameters to attribute traffic to the right source and to measure the effectiveness of campaigns, but from a traffic management point of view these parameters are horrible. Because by default all CDNs and intermediate caches are considering such requests with query strings as non-cacheable. And that means, that all these requests hit your publish instances, and the CDN and the dispatcher caches are mostly useless for that.

It’s possible to remove these request parameters on the dispatcher (using the /IgnoreUrlParams configuration). But with the traffic transformation feature of AEM CS you can remove them also directly on the CDN, so that this traffic is then served entirely from the CDN. That’s the best case situation, because then these requests never make it to origin, which improves latency for end users.

I am very happy about this feature, because with it the scaling calculation gets much easier, when such campaign traffic is handled almost entirely by the CDN. And that’s the whole idea behind using a CDN: To handle the traffic spikes.

For this reason I recommend every AEM CS customer to check out the traffic rules to filter and transform traffic at the CDN level. It is included in every AEM CS offering and you don’t need the extra WAF feature to use it.
Configure these rules to handle all your campaign traffic and increase the cache hit ratio. It’s very powerful and you can use it to make your application much more resilient.

Java interfaces, OSGI and package versions

TL;DR Be cautious when implementing interfaces provided by libraries, you can get problems when these libraries are updated. Check for the @ProviderType and @ConsumerType annotations of the Java interfaces you are using to make sure that you don’t limit yourself to a specific version of a package, as sooner or later this will cause problems.

One of the principles of object-oriented programming is the encapsulation to hide any implementation details. Java uses interfaces as a language feature to implement this principle.

OSGI uses a similar approach to implement services. An OSGI service offers its public API via a Java interface. This Java interface is exported and therefor it is visible to your Java code. And then you can use it how it is taught in every AEM (and modern OSGI) class like this:

@Reference
UserNotificationService service;

With the magic of Declarative Service a reference to an implementation of UserNotificationService is injected and you are ready to use it.

But if that interface is visible and with the power of Java at hand, you can create an instance of that class on your own:

public class MyUserNotificationService implements UserNotificationService {
...
}

Yes, this is possible and nothing prevents you from doing it. But …

Unlike Object-oriented programming, OSGI has some higher aspirations. It focuses on modular software, dedicated bundles, which can have an independent lifecycle. You should be able to extend functionality in a bundle without the need that all other code in other bundles needs to be recompiled. So a binary compatibility is important.

Assuming that the framework you are using comes with the UserNotificationService which like this

package org.framework.user;
public interface UserNotificationService {
  void notifyUserViaPopup (User user, NotificationContent notification);
}

Now you decide to implement this interface in your own codebase (hey, it’s public and Java does not prevent me from doing it) and start using it in your codebase:

public class MyUserNotificationService implements UserNotificationService {
  void notifyUserViaPopup (User user, NotificationContent notification) {
    ..
  }
}

All is working fine. But then the framework is adjusted and now the UserNotificationService looks like this:

package org.framework.user;
public interface UserNotificationService { // version 1.1
  void notifyUserViaPopup (User user, NotificationContent notification);
  void notifyUserViaEMail (User user, NotificationContent notification);
}

Now you have a problem, because MyUserNotificationService is no longer compatible to the UserNotificationService (version 1.1), because MyuserNotificationService does not implement the method notifyUserViaEmail. Most likely you can’t load your new class anymore, triggering interesting exceptions. You would need to adjust MyUserNotificationService and implement the missing method to make it run again, even if you would never need the notifyUserViaEmail functionality.

So we have 2 problems with that approach:

It will be only detected on runtime, which is too late.
You should not be required to adapt your code to changes in the other of some one else, especially if this is just an extension of the API you are not interested in at all.

OSGI has a solution for 1, but only some helpers for (2). Let’s check first the solution for (1).

Package versions and references

OSGI has the notion of “package version” and it’s best practice to provide version numbers for API packages. That means you start with a version “1.0” and and people start to use it (using service references). And when you make a compatible change (like in the example above you add a new method to the service interface) you increase the package version by a minor version to 1.1 and all existing users can still reference this service, even if their code was never compiled against the version 1.1 of the UserNotificationService. This is backwards-compatible change. If you are making a backwards-incompatible change (e.g removing a method from the service interface), you have to increase the major version to 2.0.

When you build your code and use the bnd-maven-plugin (or the maven-bundle-plugin) the plugin will automatically calculate the import range on the versions and store that information in the target/classes/META-INF/MANIFEST.MF. If you just reference services, the import range can be wide like this:

org.framework.user;version=([1.0,2)

which translates to: This bundle has a dependenty to the package org.framework.user with a version equal or higher than 1.0, but lower than (excluding) 2. That means that a bundle with this import statement will resolve with package org.framework.user 1.1. If you OSGI environment only exports org.framework.user in version 2.0, your bundle will not resolve.

(Much more can be written in this aspect, and I simplified a lot here. But the above part is the important part when you are working with AEM as a consumer of the APIs provided to you.)

Package versions and implementing interfaces

The situation gets tricky, when you are implementing exported interfaces. Because that will lock you to a specific version of the package. If you implement the MyUserNotificationService as listed above, the plugins will calculate the import range like this:

org.framework.user;version=([1.0,1.1)

This will basically lock you to that specific version 1.0 of the package. While it does not prevent changes to the implementation of any implementations of the UserNotificationService in your framework libraries, it will prevent any change to the API of it. And not only for the UserNotificationService, but also for all other classes in the org.framework.user package.

But sometimes the framework requires you to implement interfaces, and these interfaces are “guaranteed” to not change by the developers of it. In that case the above behavior does not make sense, as a change to a different class in the same package would not break any binary compatibility for these “you need to implement these interface” classes.

To handle this situation, OSGI introduced 2 java annotations, which can added to such interfaces and which clearly express the intent of the developers. They also influence the import range calculation.

The @ProviderType annotation: This annotation expresses that the developer does not want you to implement this interface. This interface is purely meant to be used to reference existing functionality (most likely provided by the same bundle as the API); if you implement such an interface, the plugin will calculate a a narrow import range.
The @ConsumerType annotation: This annotation shows the intention of the developer of the library that this interface can be implemented by other parties as well. Even if the library ships an implementation of that service on its own (so you can @Reference it) you are free to implement this interface on your own and register it as a service. If you implement such an interface with this annotation, the version import range will be wide.

In the end your goal should be not to have a narrow import version range for any library. You should allow your friendly framework developers (and AEM) to extend existing interfaces without breaking any binary compatibility. And that also means that you should not implement interfaces you are not supposed to implement.

Do not use the Stage environment in your content creation process!

Every now and then (and definitely more often than I ever expected) I come across a question about best practices, how to promote content from the AEM as a Cloud Service Stage environment to Production. The AEM CS standard process does not allow that, and on further request it turns out, that the customers

create and validate the production content on the Stage environment
and when ready, promote that content to the Production environment and publish it.

This approach contradicts quite a bit the CQ5 and AEM good practices (since basically forever!), which say:

Production content is created only on the production environment. The Stage environment is used for code validation and performance testing.

These good practice are directly implemented in AEM CS, and for that reason it is not possible to promote content from Stage to the Production environment.

But there are other implications in AEM CS, when your content creation process takes place on the Stage environment:

If your Stage environment is an integral part of your content creation process, then your Stage environment must not have any lesser SLA than the Production environment. It actually is another production environment. Which is not reflected in the SLAs in AEM CS.
If you use your Stage environment as part of the content creation process, which environment do you use for the final validation and performance testing? In the design of AEM CS this is the role of the Stage environment, because it is sized identical to Production.
in AEM CS the Production Fullstack pipeline covers both Stage and PROD environments, but in serial manner (first Stage and then PROD, often with an extended period of time for approval step in between). That means, that you can update your Stage environment, but not your Production environment, which could impact your content creation process.

For these reasons, do not expand your content creation process on 2 environments. If you have requirements which can only be satisfied with 2 dedicated and independent environments, please talk to Adobe product management early.

I am not saying that the product design is always 100% correct and that if you are wrong if you need 2 environments for content creation. But in most of the cases it was possible to fit the content creation process to the Production environment, especially with the addition of the preview publish. And if that’s still not a fit for your case, talk to Adobe early on, so we can learn about your requirements.