Monitoring Java heap

Every now and then I get the question: “What do you think if we alert at 90% heap usage of AEM?”. The answer is always longer, so I write it down here for easier linking.

TL;DR: Don’t alert on the amount of used heap, but only on garbage collection.

Java is language which relies on garbage collection (GC). Unlike other programming languages memory is managed by the runtime. The operator assigns a certain amount of RAM to the java process for usage, and that’s it. A large fraction of this RAM goes into the heap, and the Java Virtual machine (JVM) manages this heap entirely on its own.

Now, as every good runtime, the JVM is lazy and does work only when it’s required. That means it will start the garbage collection only when then the amount of free memory is low. This is probably over-simplified, but good enough for the purpose of this article.

That means that the heap usage metrics show that the heap usage is approaching 100%, and then it suddenly drops to a much lower value, because the garbage collection process just released memory which is no longer required. And then the garbage collection pauses and the processing goes on, consuming memory, until at some point the garbage collection starts again. This leads to the typical saw-tooth pattern of the JVM.

(source: Interesting Garbage Collection Patterns by Ram Lakshamanan)

For that reason it’s not helpful to use the heap usage as alerting metric, as it fluctuates too much, and it will alert you when the actual memory usage is already down.

But of course there are other situations, where the saw-tooth pattern gets less visible, as the garbage collection can release less memory with each run, and that can indeed point to a problem. How can this get measured?

In this scenario the garbage collection runs more frequently, and the less the garbage collection releases, the more often it runs, until the entire application is effectively stopped and only the garbage collection is running. That means that here you can use the amount of the time the garbage collector runs per time period. Anything below 5% is good, and anything beyond 10% is a problem.

For that reason, rather measure the garbage collection, as it is a better indicator if your heap is too small.

Delivering dynamic renditions

One of the early features of ACS AEM Commons was the Named Image Transformer as part of the release 1.5 of 2014. This feature allowed you to transform image assets dynamically with a number of options, most notable the transformation into different images dimensions to match the requirements of the frontend guidelines. This feature was quite popular and in a stripped-down scope (it does not support all features) it also made it into the WCM Core Components (called the AdaptiveImageServlet).

This feature is nice, but it suffers from a huge problem: This transformation is done dynamically on request, and depending on the image asset itself it can consume a huge amount of heap memory. The situation gets worse when many of such requests are done in parallel, and I have seen more than once situations of AEM publish instances ending up in heavy garbage collection situations, ultimately leading to crashes and/or service outages.

This problem is not really new, as pretty much the same issue also happens on asset ingestion time, when the predefined renditions are created. While on AEM 6.5 the standard solution was to externalize to this problem for asset ingestion (hello Imagemagick!), and AEM CS solved this challenge in a different and more scalable way using AssetCompute. But both solutions did not address the problem of enduser requests to these dynamic renditions, this is and was still done on request in the heap.

We have implemented a number of improvements in the AdaptiveImageServlet to improve the situation:

A limit for requested dimensions was added to keep the memory consumption “reasonable”.
The original rendition is necessarily used as a basis to render the image in the requested dimension, but rather the closest rendition, which can satisfy the requirements of the requested parameters.
An already existing rendition is delivered , if its dimensions and image format is requested.
An upcoming improvement for the AdaptiveImageServlet on AEM CS is to deliver these renditions directly from the blobstore instead of streaming the binary via the JVM.

This improves the situation already, but there are still customers and cases, where images are resized dynamically. For these users I suggest to make the these changes:

Compile a list of all required image dimensions which you need in your frontend.
And then define matching processing profiles, so that whenever such a rendition is requested via the AdaptiveImageServlet it can be served directly from an existing rendition.

That works without changes in your codebase and will improve the delivering of such assets.

And for the users of the Named Image Transformer of ACS AEM Commons I suggest to rethink the usage of it. Do you really use all of its features?

Restoring deleted content

I just wrote about backup and restore in AEM CS, and why backups cannot serve as a replacement for an archival solution. But instead it’s just designed as a precaution for major data loss and corruption.

But there is another aspect to that question: what about deleted content? Is requesting a restore the proper way to handle these cases?

Assume that you have accidentally deleted an entire subtree of pages in your AEM instance. From a functional point of view you can perform a restore to a time before this deletion of content. But that means that a rollback of the entire content is made, which means that not only this deleted content is restored, but also other changes which performed since that time would be undone.

And depending on the frequency of activities and the time you would need to restore this can be a lot. And you would need to perform all these changes again to catch-up.

The easiest way to handle such cases is to use the versioning features of AEM. Many activities trigger the creation of a version of a page, for example when you activate it, when you delete it via the UI; you can also manually trigger the creation of a version. To restore one page or even an entire subtree you can use the “Restore” and “Restore Tree” features of AEM (see the documentation).

In earlier versions of AEM versions have not been created for Assets by default, but this has changed in AEM CS; now versions are created for assets pretty much as they are creted for pages by default. That means you can use the same approach and restore versions of assets via the timeline (see the documentation).

With the proper versioning in place, most if not all of such accidental deletions or changes can be handled; this is the preferred approach to handle it, because it can be executed by regular users and does not have an impact on the rest system of the system by rolling back really all changes. And you don’t have any downtime on authoring instances.

For that reason I recommend you to work as much as possible with these features. But there are situations, where the impact is that severe that you rather want to roll back everything than restoring things through the UI. In that situation a restore is probably the better solution.

AEM CS Backup, Restores and Archival

One recurring question I see in the Adobe internal communication channels is like this: “For our customer X we need to know how long Adobe stores backups for our CS instances”.

The obvious answer to this is “7 days” (see the documentation) or “3 months” (for Offsite backup), because the backup is designed only to handle cases of data corruption of the repository. But in most cases there is a followup question “But we need access to backup data up to 5 years”. Then it’s clear that this question is not about backup, but rather about content archival and compliance. And that’s a totally different question.

TL;DR

When you need to retain content for compliance reasons, my colleagues are happy to discuss the details with you. But increasing the retention period for your backups is not a solution for it.

Compliance

So what does “content archival and compliance” mean in this situation? For regulatory and legal reasons some industries are required to retain all public statements (including websites) for some time (normally 5-10 years). And of course the implementation of that is up to the company itself. And it seems quite easy to implement an approach which holds the backups for up these 10 years around.

Some years back I spent some time on the drawing board to design a solution for an AEM on-prem customer; their requirement was to be able to prove what at any time within these 10 years was displayed to customers on their website.
We initially also thought about keeping backups around for 10 years; but then we came up with these questions:

When the content is required, a restore from that backup would be required to an environment which can host this AEM instance. Is such an environment (servers, virtual machines) available? How much of these environments would be required, assuming that this instance would be required to run for some months (throughout the entire legal process which requires content from that time)?
Assuming that an 8y old backup must be restored, are there still the old virtual machine images with Redhat Linux 7 (or whatever OS) around? Is it okay from a compliance perspective to run these old and potentially unsupported OS versions even in a secured network environment? Is the documentation still around which describes to install all of that? Does your backup system still support a restore to such an old OS version?
How would you authenticate against such an old AEM version? Would you require your users to have their old passwords at hand (if you authenticate against AEM), or does your central identity management still support the interface this old AEM version is trying for authentication?
As this is a web page, is it ensured that all external references, which are embedded into the page are also available? Think about the Javascript and CSS libraries, which are often just pulled from their respective CDN servers.
How frequently must a backup be stored? Is it okay and possible to store just the authoring instance every quarter and do not perform any cleanup (version cleanup, workflow purge, …) in that time and have all content changes versioned, so you can use the restore functionality to go back to the requested time? Or do you need to store a backup after each deployment, because each deployment has the chance to change the UI and introduce backwards incompatible changes, which render the restored content not to work anymore? And would you need to archive the publish instance as well (where normally no versions are preserved)? And are you sure that you can trust the AEM version storage enough, so you can rely on JCR versioning to recreate any intermediary states between those retained backups?
When you design such a complex process, you should definitely test the restore process regularly.
And finally: What are the costs of such a backup approach? Can you use the normal backup storage, or do you need a special solution which guarantees that the stored data cannot be tampered with?

You can see that the list of questions is long. I don’t say it is impossible, but it requires a lot of work and attention to detail.

In my project the deal breaker was the calculated storage cost (we would have required a dedicated storage, as the normal backup storage did not provide the required guarantees for archival purposes). So we decided to take a different approach, and we added a custom process which creates a PDF/A out of every activated page and stores it in the dedicated archival solution (assets are stored as is). This adds upfront costs (a custom implementation), but is much cheaper on the long run. And on top if it does not need IT to access the old version of the homepage of January 23, 2019; but instead the business users or legal can directly access the archive and fetch the respective PDF of the time they are interested in.

In AEM CS the situation is a bit different, because the majority of the questions above deal with “old AEM vs everything else around is current”, and many aspects are not relevant for customers anymore; they are in the domain of Adobe instead. But I am not aware that Adobe ever planned to setup such a time machine, which allows to re-create everything at a specific point in time (besides all implications of security etc), mostly because “everything” is a lot.

So, as a conclusion: Using backups for content archival and compliance is not the best solution. It sounds easy at first, but it raises a lot of question if look into the details. The longer you need to retain these AEM backups, the more likely will it be that inevitable changes in the surrounding environments makes a proper function harder or even impossible.

The new AEM CS feature in 2024 which I love most

Pretty much 4 years ago I joined the AEM as a Cloud Service engineering team, and since that time I am working on the platform level as a Site Reliability Engineering. I work on platform reliability and performance and help customers to improve their applications in these aspects.

But that also means, that many features which are released throughout the years are not that relevant for my work. But there are a few ones that matter a lot to me. They allow me to help customers in really good and elegant ways.

In 2024 there was one, which I like very much, and that’s the Traffic Rules feature (next to the custom error page and CDN cache purging as self-service). I like it, because it lets you filter and transform traffic at scale where it can be handled best: At the CDN layer.

Before that feature was available, all traffic handling needed to happen at the dispatcher level. The combination of the Apache httpd and dispatcher rules allowed you to perform all these operations. However, I consider it a bit problematic. Because at that point the traffic already hit the dispatcher instances. It was already in your datacenter, on your servers.

To mitigate that, many customers (both onprem/AMS or AEM CS) purchased a WAF solution to handle specifically these cases. But now with the traffic rules every AEM CS customers gets a new set of features which they can use to handle traffic on the CDN level.

The documentation is quite extensive and contains relevant examples, showcasing the ways how you can block, ratelimit or transform traffic to your needs:

The most compelling reason I rate this as my top feature this year is really the traffic transformation feature.

A part of my daily job is to help customers to prepare their AEM CS instances to handle their traffic spikes. Besides all the tunings on the backend, the biggest angle to improve this sutuation is to handle all these requests at the CDN. Because then it’s not hitting the backend at all.

A constant problem in that situation are request parameters which are added by campaigns. You might know the “utm*”, “fbclid” or “gclid” query parameters when traffic comes to your site which was clicked either on Facebook or Google. And there are many more. Analytics tool need these parameters to attribute traffic to the right source and to measure the effectiveness of campaigns, but from a traffic management point of view these parameters are horrible. Because by default all CDNs and intermediate caches are considering such requests with query strings as non-cacheable. And that means, that all these requests hit your publish instances, and the CDN and the dispatcher caches are mostly useless for that.

It’s possible to remove these request parameters on the dispatcher (using the /IgnoreUrlParams configuration). But with the traffic transformation feature of AEM CS you can remove them also directly on the CDN, so that this traffic is then served entirely from the CDN. That’s the best case situation, because then these requests never make it to origin, which improves latency for end users.

I am very happy about this feature, because with it the scaling calculation gets much easier, when such campaign traffic is handled almost entirely by the CDN. And that’s the whole idea behind using a CDN: To handle the traffic spikes.

For this reason I recommend every AEM CS customer to check out the traffic rules to filter and transform traffic at the CDN level. It is included in every AEM CS offering and you don’t need the extra WAF feature to use it.
Configure these rules to handle all your campaign traffic and increase the cache hit ratio. It’s very powerful and you can use it to make your application much more resilient.

Java interfaces, OSGI and package versions

TL;DR Be cautious when implementing interfaces provided by libraries, you can get problems when these libraries are updated. Check for the @ProviderType and @ConsumerType annotations of the Java interfaces you are using to make sure that you don’t limit yourself to a specific version of a package, as sooner or later this will cause problems.

One of the principles of object-oriented programming is the encapsulation to hide any implementation details. Java uses interfaces as a language feature to implement this principle.

OSGI uses a similar approach to implement services. An OSGI service offers its public API via a Java interface. This Java interface is exported and therefor it is visible to your Java code. And then you can use it how it is taught in every AEM (and modern OSGI) class like this:

@Reference
UserNotificationService service;

With the magic of Declarative Service a reference to an implementation of UserNotificationService is injected and you are ready to use it.

But if that interface is visible and with the power of Java at hand, you can create an instance of that class on your own:

public class MyUserNotificationService implements UserNotificationService {
...
}

Yes, this is possible and nothing prevents you from doing it. But …

Unlike Object-oriented programming, OSGI has some higher aspirations. It focuses on modular software, dedicated bundles, which can have an independent lifecycle. You should be able to extend functionality in a bundle without the need that all other code in other bundles needs to be recompiled. So a binary compatibility is important.

Assuming that the framework you are using comes with the UserNotificationService which like this

package org.framework.user;
public interface UserNotificationService {
  void notifyUserViaPopup (User user, NotificationContent notification);
}

Now you decide to implement this interface in your own codebase (hey, it’s public and Java does not prevent me from doing it) and start using it in your codebase:

public class MyUserNotificationService implements UserNotificationService {
  void notifyUserViaPopup (User user, NotificationContent notification) {
    ..
  }
}

All is working fine. But then the framework is adjusted and now the UserNotificationService looks like this:

package org.framework.user;
public interface UserNotificationService { // version 1.1
  void notifyUserViaPopup (User user, NotificationContent notification);
  void notifyUserViaEMail (User user, NotificationContent notification);
}

Now you have a problem, because MyUserNotificationService is no longer compatible to the UserNotificationService (version 1.1), because MyuserNotificationService does not implement the method notifyUserViaEmail. Most likely you can’t load your new class anymore, triggering interesting exceptions. You would need to adjust MyUserNotificationService and implement the missing method to make it run again, even if you would never need the notifyUserViaEmail functionality.

So we have 2 problems with that approach:

It will be only detected on runtime, which is too late.
You should not be required to adapt your code to changes in the other of some one else, especially if this is just an extension of the API you are not interested in at all.

OSGI has a solution for 1, but only some helpers for (2). Let’s check first the solution for (1).

Package versions and references

OSGI has the notion of “package version” and it’s best practice to provide version numbers for API packages. That means you start with a version “1.0” and and people start to use it (using service references). And when you make a compatible change (like in the example above you add a new method to the service interface) you increase the package version by a minor version to 1.1 and all existing users can still reference this service, even if their code was never compiled against the version 1.1 of the UserNotificationService. This is backwards-compatible change. If you are making a backwards-incompatible change (e.g removing a method from the service interface), you have to increase the major version to 2.0.

When you build your code and use the bnd-maven-plugin (or the maven-bundle-plugin) the plugin will automatically calculate the import range on the versions and store that information in the target/classes/META-INF/MANIFEST.MF. If you just reference services, the import range can be wide like this:

org.framework.user;version=([1.0,2)

which translates to: This bundle has a dependenty to the package org.framework.user with a version equal or higher than 1.0, but lower than (excluding) 2. That means that a bundle with this import statement will resolve with package org.framework.user 1.1. If you OSGI environment only exports org.framework.user in version 2.0, your bundle will not resolve.

(Much more can be written in this aspect, and I simplified a lot here. But the above part is the important part when you are working with AEM as a consumer of the APIs provided to you.)

Package versions and implementing interfaces

The situation gets tricky, when you are implementing exported interfaces. Because that will lock you to a specific version of the package. If you implement the MyUserNotificationService as listed above, the plugins will calculate the import range like this:

org.framework.user;version=([1.0,1.1)

This will basically lock you to that specific version 1.0 of the package. While it does not prevent changes to the implementation of any implementations of the UserNotificationService in your framework libraries, it will prevent any change to the API of it. And not only for the UserNotificationService, but also for all other classes in the org.framework.user package.

But sometimes the framework requires you to implement interfaces, and these interfaces are “guaranteed” to not change by the developers of it. In that case the above behavior does not make sense, as a change to a different class in the same package would not break any binary compatibility for these “you need to implement these interface” classes.

To handle this situation, OSGI introduced 2 java annotations, which can added to such interfaces and which clearly express the intent of the developers. They also influence the import range calculation.

The @ProviderType annotation: This annotation expresses that the developer does not want you to implement this interface. This interface is purely meant to be used to reference existing functionality (most likely provided by the same bundle as the API); if you implement such an interface, the plugin will calculate a a narrow import range.
The @ConsumerType annotation: This annotation shows the intention of the developer of the library that this interface can be implemented by other parties as well. Even if the library ships an implementation of that service on its own (so you can @Reference it) you are free to implement this interface on your own and register it as a service. If you implement such an interface with this annotation, the version import range will be wide.

In the end your goal should be not to have a narrow import version range for any library. You should allow your friendly framework developers (and AEM) to extend existing interfaces without breaking any binary compatibility. And that also means that you should not implement interfaces you are not supposed to implement.

Do not use the Stage environment in your content creation process!

Every now and then (and definitely more often than I ever expected) I come across a question about best practices, how to promote content from the AEM as a Cloud Service Stage environment to Production. The AEM CS standard process does not allow that, and on further request it turns out, that the customers

create and validate the production content on the Stage environment
and when ready, promote that content to the Production environment and publish it.

This approach contradicts quite a bit the CQ5 and AEM good practices (since basically forever!), which say:

Production content is created only on the production environment. The Stage environment is used for code validation and performance testing.

These good practice are directly implemented in AEM CS, and for that reason it is not possible to promote content from Stage to the Production environment.

But there are other implications in AEM CS, when your content creation process takes place on the Stage environment:

If your Stage environment is an integral part of your content creation process, then your Stage environment must not have any lesser SLA than the Production environment. It actually is another production environment. Which is not reflected in the SLAs in AEM CS.
If you use your Stage environment as part of the content creation process, which environment do you use for the final validation and performance testing? In the design of AEM CS this is the role of the Stage environment, because it is sized identical to Production.
in AEM CS the Production Fullstack pipeline covers both Stage and PROD environments, but in serial manner (first Stage and then PROD, often with an extended period of time for approval step in between). That means, that you can update your Stage environment, but not your Production environment, which could impact your content creation process.

For these reasons, do not expand your content creation process on 2 environments. If you have requirements which can only be satisfied with 2 dedicated and independent environments, please talk to Adobe product management early.

I am not saying that the product design is always 100% correct and that if you are wrong if you need 2 environments for content creation. But in most of the cases it was possible to fit the content creation process to the Production environment, especially with the addition of the preview publish. And if that’s still not a fit for your case, talk to Adobe early on, so we can learn about your requirements.

Do not use AEM as a proxy for backend calls

Since I am working with AEM CS customers, I came a few time across the architecture pattern, that requests made to a site to passed all the way through to the AEM instance (bypassing all caches), and then AEM does an outbound request to a backend system (for example a PIM system or other API service, sometimes public, sometimes via VPN), collects the result and sends back the response.

This architectural pattern is problematic in a few ways:

AEM handles requests with a threadpool, which has an upper limit of requests it will handle (by default 200). That means that at any time the number of such backend requests is limited by the amount of AEM instances. In AEM CS this number is variable (auto-scaling), but even in an auto-scaling world there is an upper limit.
The most important factor in the number of such requests AEM can handle per second is the latency of the backend system call. For example if your backend system responds always in less than 100ms, your AEM can handle up to 2000 of such proxy requests per second. If the latency is more likely 1 second, it’s only up to 200 proxy requests per second. This can be enough, this can be way too small.
To achieve such a throughput consistently, you need to have agressive timeouts; if you configure your timeouts with 2 seconds, your guaranteed throughput can only be up to 100 proxy requests/seconds.
And next to all those proxy requests your AEM instances also need to handle the other duties of AEM, most importantly rendering pages and delivering assets. That will reduce the number of threads you can utilize for such backend calls.

The most common issue I have seen with this pattern is that in case of backend performance problems the AEM threadpool of all AEM instances are consumed within seconds, leading almost immediately to an outage of the AEM service. That means, that a problem on the backend or on the connection between AEM and the backend takes down your page rendering abilities, leaving you with what is cached at the CDN level.

The common recommendation we make in these cases is quite obvious: introduce more agressive timeouts. But the actual solution to this problem is a different one:

Do not use AEM as a proxy.

This is a perfect example for a case, where the client (browser) itself can do the integration. Instead of proxy-ing (=tunneling) all backend traffic through AEM, the client could approach the backend service directly. Because then the constraints AEM has (for example the number of concurrent requests) do no longer apply for the calls to the backend. Instead the backend is exposed directly to the endusers, and uses whatever technology is suitable for that; typically it is exposed via an API gateway.

If the backend gets slow, AEM is not affected. If AEM has issues, the backend is not directly impacted because of it. AEM does not even need to know that there is a backend at all. Both systems are entirely decoupled.

As you see, I pretty much prefer this approach of “integration at the frontend layer” and exposing the backend to the endusers over any type of “AEM calls the backend systems”. Mostly because such architectures are less complex and easier to debug and analyze. And that should be your default and preferred approach, whenever this required.

Disclaimer: Yes, there are cases where the application logic requires AEM to do backend calls; but in these cases it’s questionable if such requests need to be done synchronously in requests, meaning that an AEM request needs to do a backend call to consume its result. If these request can be done async, then the whole problem vector I outlined above simply does not exist.

Note: In my opinion hiding the hostnames of your backend system is also not a good reason for such an backend integration. Also “the service is just available from within our company network and AEM accesses it via VPN” is not a good reason, too. In both cases you can achieve the same with an publicly accessible API gateway, which is specifically designed to handle such usecases and all security-relevant implications of it.

So, do not use AEM as a simple proxy!

My view on manual cache flushing

I read the following statement by Samuel Fawaz on LinkedIn regarding the recent announcement of the self-service feature to get the API key for CDN purge for AEM as a Cloud Service:

[…] 𝘚𝘰𝘮𝘦𝘵𝘪𝘮𝘦𝘴 𝘵𝘩𝘦 𝘊𝘋𝘕 𝘤𝘢𝘤𝘩𝘦 𝘪𝘴 𝘫𝘶𝘴𝘵 𝘮𝘦𝘴𝘴𝘦𝘥 𝘶𝘱 𝘢𝘯𝘥 𝘺𝘰𝘶 𝘸𝘢𝘯𝘵 𝘵𝘰 𝘤𝘭𝘦𝘢𝘯 𝘰𝘶𝘵 𝘦𝘷𝘦𝘳𝘺𝘵𝘩𝘪𝘯𝘨. 𝘕𝘰𝘸 𝘺𝘰𝘶 𝘤𝘢𝘯.

I fully agree, that a self-service for this feature was overdue. But I always wonder why an explicit cache flush (both for CDN and dispatcher) is necessary at all.

The caching rules are very simple, as the rules for the AEM as a Cloud Service CDN are all based on the TTL (time-to-live) information sent from AEM or the dispatcher configuration. The caching rules for the dispatcher are equally simple and should be well understood (I find that this blog post on the TechRevel blog covers this topic of dispatcher cache flushing quite well).

In my opinion it should be doable to build a model which allows you to make assumptions, how long it takes for a page update to be visible to all users on the CDN. And it also allows you to reason about more complex situations (especially when content is pulled from multiple pages/areas to render) and understand how and when content changes are getting visible for endusers.

But when I look at the customer requests coming in for cache flushes (CDN and dispatcher), I think that in most cases there is no clear understanding what actually happened; most often it’s just that on the authoring the content is as expected and activated properly, but this change does not show up the same way on publish. The solution is often to request a cache flush (or trigger it yourself) and hope for the best. And very often this fixes the problem, and then the most up-to-date content is delivered.

But is there an understanding why the caches were not updated properly? Honestly, I doubt that very often. The same way as infamous “Windows restart” can fix annoying, suddenly appearing problems with your computer, flushing caches seems be one of the first steps for fixing content problems. The issues goes away, we shrug and go on with our work.

But unlike in the case of Windows the situation is different here, because you have the dispatcher configuration in your git repository. And you know the rules of caching. You have everything you need to have to understand the problem better and even fix it from happening again.

Whenever the authoring users come to you with that request “content is not showing up, please flush the cache”, you should consider this situation as a bug. Because it’s a bug, as the system is not work as expected. You should apply the workaround (do the flush), but afterwards invest time into the analysis and root-cause analysis (RCA), why it happened. Understand and adjust the caching rules. Because very often these cases are well reproducible.

In his LinkedIn post Samuel writes “Sometimes the CDN cache is just messed up“, and I think that is not true. It’s not that it’s a random event you cannot influence at all. On the contrary. It’s an event which is defined by your caching configuration. It’s an event which you can control and prevent, you just need to understand how. And I think that this step of understanding and then fixing it is missing very often. And then the next from request from your authoring users for a cache flush is inevitable, and another cache flush is executed.

In the end flushing caches comes with the price of increased latency for endusers until the cache is populated again. And that’s a situation we should avoid as good as we can.

So as a conclusion:

An explicitly requested cache clear is a bug because it means that something is not working as expected.
And as every bug it should be understood and fixed, so you are no longer required to perform the workaround.

Adopting AEM as a Cloud Service: Shifting from Code-Centric Approaches

The first CQ5 version I worked with was CQ 5.2.0 in late 2009; and since then a lot changed. I could list a lot of technical changes and details, but that’s not the most interesting part. I want to propose this hypothesis as the most important change:

CQ5 was a framework which you had to customize to get value out of it. Starting with AEM 6.x more and more out-of-the-box features were added which can be used directly. In AEM as a Cloud Service most new features are directly usable, not requiring (or even allowing) customization.

And as corollary: The older your code base the more customizations, and the harder is the adoption of new features.

As a SRE in AEM as a Cloud Service I work with many customers, which migrated their application over from an AEM 6.x version. While the “best practice analyzer” is a great help to get your application ported to AEM CS, it’s just this: It helps you to migrate your customizations, the (sometimes) vast amount of overlays for the authoring UI, backend integrations, complex business and rendering logic, JSPs, et cetera. And very often this code is based on the AEM framework only and could technically still run on CQ 5.6.1, because it works with Nodes, Resources, Assets and Pages as the only building blocks.

While this was the most straight-forward way in the times of CQ5, it becomes more and more a problem in later versions. With the introduction of Content Fragments, Experience Fragments, Core Components, Universal Editor, Edge Delivery Services and others, many new features were added which often do not fit into the self-grown application structures. These product features are promoted and demoed, and it’s understandable that the business users want to use them. But the adoption of these new features would often require large refactorings, proper planning and a budget for it. Nothing you do in a single 2-week sprint.

But this situation also has impact on the developers themselves. While customizations through code were the standard procedure in CQ5, there are often other ways available in AEM CS. But when I read through the AEM forums and new blog posts for AEM, I still see a large focus on coding: Custom servlets, sling models, filters, whatever. Often using the same old CQ5 style we had to use 10 years ago, because there was nothing else. That approach still works, but it will lead you into the customization hell again. Also many in violation of the practices recommended for AEM CS.

That means:

If you want to start an AEM CS project in 2024, please don’t follow the same old approach.
Make sure that you understand the new features introduced in the last 10 years, and how you can mix and match them to implement the requirements.
Opening the IDE and start coding should be your last resort.

It also makes sense to talk with Adobe about the requirements you need to implement; I see that features requested by many customers are often prioritized and are implemented with customer involvement; a way which is much easier to do in AEM CS than before.