AEM CS: Java 21 update

After a lengthy preparation period, this year the rollout of Java 21 will start for AEM as a Cloud Service. While the public documentation contains all relevant information (and I don’t want to reiterate them here), I want to make a few things more clear.

First, this is the update of the Java version used to run AEM as a Cloud Service. This version can be different from the Java version which is used to build the application. As Java versions are backwards compatible and can read binaries created by older versions, it is entirely possible to run the AEM CS instance with Java 21, but still build the application with Java 11. Of course this restricts you to the language features of Java 11 and for example you cannot use Records, but besides that there is no negative impact at all.

This scenario is fully supported; but at some point you need to update your build version to a newer Java version, as freshly added APIs might use Java features which are not available in Java 11. And as a personal recommendation I would suggest to switch also your build time Java version to Java 21.

This change of the runtime Java version should in most cases be totally invisible for you; at least as long as you don’t use or add 3rd-party libraries, which need to support new Java versions explicitly; the most prominent libraries in the AEM context are Groovy (often as part of the Groovy console) and the ASM library (a library which allows to create and modify Java bytecode). If you deploy one of these into your AEM instance, make sure that you update these to a version which supports Java 21.

AEM CS Backup, Restores and Archival

One recurring question I see in the Adobe internal communication channels is like this: “For our customer X we need to know how long Adobe stores backups for our CS instances”.

The obvious answer to this is “7 days” (see the documentation) or “3 months” (for Offsite backup), because the backup is designed only to handle cases of data corruption of the repository. But in most cases there is a followup question “But we need access to backup data up to 5 years”. Then it’s clear that this question is not about backup, but rather about content archival and compliance. And that’s a totally different question.

TL;DR

When you need to retain content for compliance reasons, my colleagues are happy to discuss the details with you. But increasing the retention period for your backups is not a solution for it.

Compliance

So what does “content archival and compliance” mean in this situation? For regulatory and legal reasons some industries are required to retain all public statements (including websites) for some time (normally 5-10 years). And of course the implementation of that is up to the company itself. And it seems quite easy to implement an approach which holds the backups for up these 10 years around.

Some years back I spent some time on the drawing board to design a solution for an AEM on-prem customer; their requirement was to be able to prove what at any time within these 10 years was displayed to customers on their website.
We initially also thought about keeping backups around for 10 years; but then we came up with these questions:

  • When the content is required, a restore from that backup would be required to an environment which can host this AEM instance. Is such an environment (servers, virtual machines) available? How much of these environments would be required, assuming that this instance would be required to run for some months (throughout the entire legal process which requires content from that time)?
  • Assuming that an 8y old backup must be restored, are there still the old virtual machine images with Redhat Linux 7 (or whatever OS) around? Is it okay from a compliance perspective to run these old and potentially unsupported OS versions even in a secured network environment? Is the documentation still around which describes to install all of that? Does your backup system still support a restore to such an old OS version?
  • How would you authenticate against such an old AEM version? Would you require your users to have their old passwords at hand (if you authenticate against AEM), or does your central identity management still support the interface this old AEM version is trying for authentication?
  • As this is a web page, is it ensured that all external references, which are embedded into the page are also available? Think about the Javascript and CSS libraries, which are often just pulled from their respective CDN servers.
  • How frequently must a backup be stored? Is it okay and possible to store just the authoring instance every quarter and do not perform any cleanup (version cleanup, workflow purge, …) in that time and have all content changes versioned, so you can use the restore functionality to go back to the requested time? Or do you need to store a backup after each deployment, because each deployment has the chance to change the UI and introduce backwards incompatible changes, which render the restored content not to work anymore? And would you need to archive the publish instance as well (where normally no versions are preserved)? And are you sure that you can trust the AEM version storage enough, so you can rely on JCR versioning to recreate any intermediary states between those retained backups?
  • When you design such a complex process, you should definitely test the restore process regularly.
  • And finally: What are the costs of such a backup approach? Can you use the normal backup storage, or do you need a special solution which guarantees that the stored data cannot be tampered with?

You can see that the list of questions is long. I don’t say it is impossible, but it requires a lot of work and attention to detail.

In my project the deal breaker was the calculated storage cost (we would have required a dedicated storage, as the normal backup storage did not provide the required guarantees for archival purposes). So we decided to take a different approach, and we added a custom process which creates a PDF/A out of every activated page and stores it in the dedicated archival solution (assets are stored as is). This adds upfront costs (a custom implementation), but is much cheaper on the long run. And on top if it does not need IT to access the old version of the homepage of January 23, 2019; but instead the business users or legal can directly access the archive and fetch the respective PDF of the time they are interested in.

In AEM CS the situation is a bit different, because the majority of the questions above deal with “old AEM vs everything else around is current”, and many aspects are not relevant for customers anymore; they are in the domain of Adobe instead. But I am not aware that Adobe ever planned to setup such a time machine, which allows to re-create everything at a specific point in time (besides all implications of security etc), mostly because “everything” is a lot.

So, as a conclusion: Using backups for content archival and compliance is not the best solution. It sounds easy at first, but it raises a lot of question if look into the details. The longer you need to retain these AEM backups, the more likely will it be that inevitable changes in the surrounding environments makes a proper function harder or even impossible.

The new AEM CS feature in 2024 which I love most

Pretty much 4 years ago I joined the AEM as a Cloud Service engineering team, and since that time I am working on the platform level as a Site Reliability Engineering. I work on platform reliability and performance and help customers to improve their applications in these aspects.

 But that also means, that many features which are released throughout the years are not that relevant for my work. But there are a few ones that matter a lot to me. They allow me to help customers in really good and elegant ways.

In 2024 there was one, which I like very much, and that’s the Traffic Rules feature (next to the custom error page and CDN cache purging as self-service). I like it, because it lets you filter and transform traffic at scale where it can be handled best: At the CDN layer.

Before that feature was available, all traffic handling needed to happen at the dispatcher level. The combination of the Apache httpd and dispatcher rules allowed you to perform all these operations. However, I consider it a bit problematic. Because at that point the traffic already hit the dispatcher instances. It was already in your datacenter, on your servers.

To mitigate that, many customers (both onprem/AMS or AEM CS) purchased a WAF solution to handle specifically these cases. But now with the traffic rules every AEM CS customers gets a new set of features which they can use to handle traffic on the CDN level.

The documentation is quite extensive and contains relevant examples, showcasing the ways how you can block, ratelimit or transform traffic to your needs:

The most compelling reason I rate this as my top feature this year is really the traffic transformation feature.

A part of my daily job is to help customers to prepare their AEM CS instances to handle their traffic spikes. Besides all the tunings on the backend, the biggest angle to improve this sutuation is to handle all these requests at the CDN. Because then it’s not hitting the backend at all.

A constant problem in that situation are request parameters which are added by campaigns. You might know the “utm*”, “fbclid” or “gclid” query parameters when traffic comes to your site which was clicked either on Facebook or Google. And there are many more. Analytics tool need these parameters to attribute traffic to the right source and to measure the effectiveness of campaigns, but from a traffic management point of view these parameters are horrible. Because by default all CDNs and intermediate caches are considering such requests with query strings as non-cacheable. And that means, that all these requests hit your publish instances, and the CDN and the dispatcher caches are mostly useless for that.

It’s possible to remove these request parameters on the dispatcher (using the /IgnoreUrlParams configuration). But with the traffic transformation feature of AEM CS you can remove them also directly on the CDN, so that this traffic is then served entirely from the CDN. That’s the best case situation, because then these requests never make it to origin, which improves latency for end users.

I am very happy about this feature, because with it the scaling calculation gets much easier, when such campaign traffic is handled almost entirely by the CDN. And that’s the whole idea behind using a CDN: To handle the traffic spikes.

For this reason I recommend every AEM CS customer to check out the traffic rules to filter and transform traffic at the CDN level. It is included in every AEM CS offering and you don’t need the extra WAF feature to use it.
Configure these rules to handle all your campaign traffic and increase the cache hit ratio. It’s very powerful and you can use it to make your application much more resilient.

Do not use AEM as a proxy for backend calls

Since I am working with AEM CS customers, I came a few time across the architecture pattern, that requests made to a site to passed all the way through to the AEM instance (bypassing all caches), and then AEM does an outbound request to a backend system (for example a PIM system or other API service, sometimes public, sometimes via VPN), collects the result and sends back the response.

This architectural pattern is problematic in a few ways:

  1. AEM handles requests with a threadpool, which has an upper limit of requests it will handle (by default 200). That means that at any time the number of such backend requests is limited by the amount of AEM instances. In AEM CS this number is variable (auto-scaling), but even in an auto-scaling world there is an upper limit.
  2. The most important factor in the number of such requests AEM can handle per second is the latency of the backend system call. For example if your backend system responds always in less than 100ms, your AEM can handle up to 2000 of such proxy requests per second. If the latency is more likely 1 second, it’s only up to 200 proxy requests per second. This can be enough, this can be way too small.
  3. To achieve such a throughput consistently, you need to have agressive timeouts; if you configure your timeouts with 2 seconds, your guaranteed throughput can only be up to 100 proxy requests/seconds.
  4. And next to all those proxy requests your AEM instances also need to handle the other duties of AEM, most importantly rendering pages and delivering assets. That will reduce the number of threads you can utilize for such backend calls.

The most common issue I have seen with this pattern is that in case of backend performance problems the AEM threadpool of all AEM instances are consumed within seconds, leading almost immediately to an outage of the AEM service. That means, that a problem on the backend or on the connection between AEM and the backend takes down your page rendering abilities, leaving you with what is cached at the CDN level.

The common recommendation we make in these cases is quite obvious: introduce more agressive timeouts. But the actual solution to this problem is a different one:

Do not use AEM as a proxy.

This is a perfect example for a case, where the client (browser) itself can do the integration. Instead of proxy-ing (=tunneling) all backend traffic through AEM, the client could approach the backend service directly. Because then the constraints AEM has (for example the number of concurrent requests) do no longer apply for the calls to the backend. Instead the backend is exposed directly to the endusers, and uses whatever technology is suitable for that; typically it is exposed via an API gateway.

If the backend gets slow, AEM is not affected. If AEM has issues, the backend is not directly impacted because of it. AEM does not even need to know that there is a backend at all. Both systems are entirely decoupled.

As you see, I pretty much prefer this approach of “integration at the frontend layer” and exposing the backend to the endusers over any type of “AEM calls the backend systems”. Mostly because such architectures are less complex and easier to debug and analyze. And that should be your default and preferred approach, whenever this required.

Disclaimer: Yes, there are cases where the application logic requires AEM to do backend calls; but in these cases it’s questionable if such requests need to be done synchronously in requests, meaning that an AEM request needs to do a backend call to consume its result. If these request can be done async, then the whole problem vector I outlined above simply does not exist.

Note: In my opinion hiding the hostnames of your backend system is also not a good reason for such an backend integration. Also “the service is just available from within our company network and AEM accesses it via VPN” is not a good reason, too. In both cases you can achieve the same with an publicly accessible API gateway, which is specifically designed to handle such usecases and all security-relevant implications of it.

So, do not use AEM as a simple proxy!

My view on manual cache flushing

I read the following statement by Samuel Fawaz on LinkedIn regarding the recent announcement of the self-service feature to get the API key for CDN purge for AEM as a Cloud Service:

[…] 𝘚𝘰𝘮𝘦𝘵𝘪𝘮𝘦𝘴 𝘵𝘩𝘦 𝘊𝘋𝘕 𝘤𝘢𝘤𝘩𝘦 𝘪𝘴 𝘫𝘶𝘴𝘵 𝘮𝘦𝘴𝘴𝘦𝘥 𝘶𝘱 𝘢𝘯𝘥 𝘺𝘰𝘶 𝘸𝘢𝘯𝘵 𝘵𝘰 𝘤𝘭𝘦𝘢𝘯 𝘰𝘶𝘵 𝘦𝘷𝘦𝘳𝘺𝘵𝘩𝘪𝘯𝘨. 𝘕𝘰𝘸 𝘺𝘰𝘶 𝘤𝘢𝘯.

I fully agree, that a self-service for this feature was overdue. But I always wonder why an explicit cache flush (both for CDN and dispatcher) is necessary at all.

The caching rules are very simple, as the rules for the AEM as a Cloud Service CDN are all based on the TTL (time-to-live) information sent from AEM or the dispatcher configuration. The caching rules for the dispatcher are equally simple and should be well understood (I find that this blog post on the TechRevel blog covers this topic of dispatcher cache flushing quite well).

In my opinion it should be doable to build a model which allows you to make assumptions, how long it takes for a page update to be visible to all users on the CDN. And it also allows you to reason about more complex situations (especially when content is pulled from multiple pages/areas to render) and understand how and when content changes are getting visible for endusers.

But when I look at the customer requests coming in for cache flushes (CDN and dispatcher), I think that in most cases there is no clear understanding what actually happened; most often it’s just that on the authoring the content is as expected and activated properly, but this change does not show up the same way on publish. The solution is often to request a cache flush (or trigger it yourself) and hope for the best. And very often this fixes the problem, and then the most up-to-date content is delivered.

But is there an understanding why the caches were not updated properly? Honestly, I doubt that very often. The same way as infamous “Windows restart” can fix annoying, suddenly appearing problems with your computer, flushing caches seems be one of the first steps for fixing content problems. The issues goes away, we shrug and go on with our work.

But unlike in the case of Windows the situation is different here, because you have the dispatcher configuration in your git repository. And you know the rules of caching. You have everything you need to have to understand the problem better and even fix it from happening again.

Whenever the authoring users come to you with that request “content is not showing up, please flush the cache”, you should consider this situation as a bug. Because it’s a bug, as the system is not work as expected. You should apply the workaround (do the flush), but afterwards invest time into the analysis and root-cause analysis (RCA), why it happened. Understand and adjust the caching rules. Because very often these cases are well reproducible.

In his LinkedIn post Samuel writes “Sometimes the CDN cache is just messed up“, and I think that is not true. It’s not that it’s a random event you cannot influence at all. On the contrary. It’s an event which is defined by your caching configuration. It’s an event which you can control and prevent, you just need to understand how. And I think that this step of understanding and then fixing it is missing very often. And then the next from request from your authoring users for a cache flush is inevitable, and another cache flush is executed.

In the end flushing caches comes with the price of increased latency for endusers until the cache is populated again. And that’s a situation we should avoid as good as we can.

So as a conclusion:

  • An explicitly requested cache clear is a bug because it means that something is not working as expected.
  • And as every bug it should be understood and fixed, so you are no longer required to perform the workaround.

Adopting AEM as a Cloud Service: Shifting from Code-Centric Approaches

The first CQ5 version I worked with was CQ 5.2.0 in late 2009; and since then a lot changed. I could list a lot of technical changes and details, but that’s not the most interesting part. I want to propose this hypothesis as the most important change:

CQ5 was a framework which you had to customize to get value out of it. Starting with AEM 6.x more and more out-of-the-box features were added which can be used directly. In AEM as a Cloud Service most new features are directly usable, not requiring (or even allowing) customization.

And as corollary: The older your code base the more customizations, and the harder is the adoption of new features.

As a SRE in AEM as a Cloud Service I work with many customers, which migrated their application over from an AEM 6.x version. While the “best practice analyzer” is a great help to get your application ported to AEM CS, it’s just this: It helps you to migrate your customizations, the (sometimes) vast amount of overlays for the authoring UI, backend integrations, complex business and rendering logic, JSPs, et cetera. And very often this code is based on the AEM framework only and could technically still run on CQ 5.6.1, because it works with Nodes, Resources, Assets and Pages as the only building blocks.

While this was the most straight-forward way in the times of CQ5, it becomes more and more a problem in later versions. With the introduction of Content Fragments, Experience Fragments, Core Components, Universal Editor, Edge Delivery Services and others, many new features were added which often do not fit into the self-grown application structures. These product features are promoted and demoed, and it’s understandable that the business users want to use them. But the adoption of these new features would often require large refactorings, proper planning and a budget for it. Nothing you do in a single 2-week sprint.

But this situation also has impact on the developers themselves. While customizations through code were the standard procedure in CQ5, there are often other ways available in AEM CS. But when I read through the AEM forums and new blog posts for AEM, I still see a large focus on coding: Custom servlets, sling models, filters, whatever. Often using the same old CQ5 style we had to use 10 years ago, because there was nothing else. That approach still works, but it will lead you into the customization hell again. Also many in violation of the practices recommended for AEM CS.

That means:

  • If you want to start an AEM CS project in 2024, please don’t follow the same old approach.
  • Make sure that you understand the new features introduced in the last 10 years, and how you can mix and match them to implement the requirements.
  • Opening the IDE and start coding should be your last resort.

It also makes sense to talk with Adobe about the requirements you need to implement; I see that features requested by many customers are often prioritized and are implemented with customer involvement; a way which is much easier to do in AEM CS than before.

AEM CS & Mongo exceptions

If you are an avid log checker on your AEM CS environments you might have come across messages like this in your authoring logs:

02.04.2024 13:37:42:1234 INFO [cluster-ClusterId{value='6628de4fc6c9efa', description='MongoConnection for Oak DocumentMK'}-cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net:27017] org.mongodb.driver.cluster Exception in monitor thread while connecting to server cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net:27017 com.mongodb.MongoSocketException: cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net 
at com.mongodb.ServerAddress.getSocketAddresses(ServerAddress.java:211) [org.mongodb.mongo-java-driver:3.12.7]
at com.mongodb.internal.connection.SocketStream.initializeSocket(SocketStream.java:75) [org.mongodb.mongo-java-driver:3.12.7]
...
Caused by: java.net.UnknownHostException: cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net

And you might wonder what is going on. I get this question every now and then, often assuming that this something problematic. Because we have all learned that stacktraces normally indicate problems. And on first sight this indicates a problem, that a specific hostname cannot be resolved. Is there a DNS problem in AEM CS?

Actually this message does not indicate any problem. The reason behind this is the way how mongodb implemented scaling operations. If you up- or downscale the mongo cluster, this does not happen in-place, but you get actually a new mongo cluster of the new size and of course the same content. And this new cluster comes with a new hostname.

So in this situation there was a scaling operation, and AEM CS connected to the new cluster and now looses connection to the old cluster, because the older cluster is stopped and its DNS entry is removed. Which is of course expected. And for that reason you can also see that this is logged on level INFO, and not as an ERROR.

Unfortunately this is a log message created by the mongo-driver itself, so this cannot be changed on the Oak level by removing the stacktrace from this message and changing the message itself. And for that reason you will continue to see it in the AEM CS logs, until a new improved mongo driver changes that.

Thoughts on performance testing on AEM CS

Performance is an interesting topic on its own, and I already wrote a bit about it in this blog (see the overview). I have not written yet about performance testing in the context of AEM CS. It’s not that it is fundamentally different, but there are some specifics, which you should be aware of.

  • Perform your performance tests on the Stage environment. The stage environment is kept at the same sizing as the production environment, so it should deliver the same behavior. and your PROD environment, if you have the same content and your test is realistic.
  • Use a warmup phase. As the Stage environment is normally downscaled to the minimum (because there is no regular traffic), it can take a bit of time until it has upscaled (automatically) to the same number of instances as your PROD is normally operating with. That means that your test should have a proper warmup period, during you which increase the traffic to the normal 100% level of production. This warmup phase should take at least 30 minutes.
  • I think that any test should take at least 60-90 minutes (including warmup); even if you see early that the result is not what you expect to be, there is often something to learn even from such incorrect/faulty situations. I had the case that a customer was constantly terminating the after about 20-25 minutes, claiming that something was not working server-side as they expected it to be. Unfortunately the situation has not yet settled, so I was not able to get any useful information from the system.
  • AEM CS comes with a CDN bundled to the environment, and that’s the case also for the Stage environment. But that also means that your performance test should contain all requests, which you expect to be delivered from the CDN. This is important because it can show if your caching is working as intended. Also only then you can assess the impact of the cache misses (when files expire on the CDN) on the overall performance.
  • While you are at it, you can run a stage pipeline during the performance test and deploy new code. You should not see any significant change in performance during that time.
  • Oh yes, also do some content activations in that time. That makes your test much more realistic and also reveal potential performance problems when updating content (e.g. because you constantly invalidate the complete dispatcher cache).
  • You should focus on a large content set when you do the performance test. If you only test a handful of pages/assets/files, you are mostly testing caches (at all levels).
  • Campaign-traffic” is rarely tested. This is traffic, which has some query strings attached (e.g. “utm_source”, “gclid” and such) to support traffic attribution. These parameters are ignored while rendering, but they often bypass all caching layers, hitting AEM. And while a regular performance test only tests without these paramters, if you marketing department runs a facebook campaign, the traffic from that campaign looks much different, and then the results of your performance tests are not valid anymore.

Some words as precaution:

  • A performance test can look like a DOS, and your requests can get blocked for that reason. This can happen especially if these requests are originating from a single source IP. For that reason you should distribute your load injector and use multiple source IP addresses. In case you still get blocked, please contact support so we can adapt accordingly.
  • AEM CS uses an affinity cookie to indicate that requests of a user-agent are handled by a specific backend system. If you use the same affinity cookie throughout all your performance tests, you just test a single backend system; and that effectively disables any loadbalancing and renders the performance test results unusable. Make sure that you design your performance tests with that in mind.

I general I prefer it much if I can help you during the performance phase, than to handle escalations for of bad performance and potential outages because of it. I hope that you think the same way.

If you have curl, every problem looks like a request

If you are working in IT (or a crafter) you should know the saying: “When you have a hammer, every problem looks like a nail”. It describes the tendency of people, that if they have a tool, which helps them to reliably solve a specific problem, that they will try to use this tool at every other problem, even if it does not fit at all.

Sometimes I see this pattern in AEM as well, but not with a hammer, but with “curl”. Curl is a commandline HTTP client, and it’s quite easy to fire a request against AEM and do something with the output of it. It’s something every AEM developer should be familiar with, also because it’s a great tool to automate things. And if you talk about “automating AEM”, the first thing people often come up with is “curl”…

And there the problem starts: Not every problem can be automated with curl. For example take a periodic data export from AEM. The immediate reaction of most developers (forgive me if I generalize here, but I have seen this pattern too often!) is to write a servlet to pull all this data together, create a CSV list and then use curl to request this servlet every day/week.

Works great, does it? Good, mark that task as done, next!

Wait a second, on prod it takes 2 minutes to create that list. Well, not a problem, right? Until it takes 20 minutes, because the number of assets is growing. And until you move to AEM CS, where the timeout of requests is 60 seconds, and your curl is terminated with a statuscode 503.

So what is the problem? It is not the timeout of 60 seconds; and it’s also not the constantly increasing number of assets. It’s the fact, that this is a batch operation, and you use a communication pattern (request/response), which is not well suited for batch operations. It’s the fact, that you start with curl in mind (a tool which is built for the request/response pattern) and therefor you build the implementation around it this pattern. You have curl, so every problem is solved with a request.

What are the limits of this request/response pattern? Definitely the runtime is a limit, and actually for 3 reasons:

  • The timeout for requests on AEM CS (or basically any other loadbalancer) is set for security reasons and to keep the prevent misuse. Of course the limit of 60 seconds in AEM CS is a bit arbitrary, but personally I would not wait 60 seconds for a webpage to start rendering. So it’s as good as any higher number.
  • There is another limit, which is determined by the availability of the backend system, which is actually processing this request. In an high-available and autoscaling environment systems start and stop in an automated fashion, managed by a control-plane which operates on a set of rules. And these rules can enforce, that any (AEM-) system will be forced to shutdown at maximum 10 minutes after it has stopped to receive new requests. And that means for a requests, which would take constantly 30+ minutes, that it might be terminated, without finishing successfully. And it’s unclear if your curl would even realize it (especially if you are streaming the results).
  • (And technically you can also add that the whole network connection needs to be kept open for that long, and AEM CS itself is just a single factor in there. Also the internet is not always stable, you can experience network hiccups and any point in time. It’s normally just well hidden by retrying failing requests. Which is not an option here, because it won’t solve the problem at all.)

In short: If your task can take long (say: 60+ seconds), then a request is not necessarily the best option to implement it.

So, what options do you have then? Well, the following approach works also in AEM CS:

  1. Use a request to create and initiate your task (let’s call it a “job”);
  2. And then poll the system until this job is completed, then return the result.

This is an asynchronous pattern, and it’s much more scalable when it comes to the amount of processing you can do in there.

Of course you cannot use a single curl command anymore, but now you need to write a program to execute this logic (don’t write it in a shell-script please!); but on the AEM side you can now use either sling jobs or AEM workflows and perform the operation.

But this avoids this restriction on 60 seconds and it can handle restarts of AEM transparently, at least on author side. And you have the huge benefit, that you can collect all your errors during the runtime of this job and decide afterwards, if the execution was a success or failed (which you cannot do in HTTP).

So when you have long-running operations, check if you need to do them within a request. In many cases it’s not required, and then please switch gears to some asynchronous pattern. And that’s something you can do even before the situation starts to get a problem.

Identify repository access

Performance tuning in AEM is typically a tough job. The most obvious and widely known aspect is the tuning of JCR queries, but that’s all; if your code is not doing any JCR query and still slow, it’s getting hard. For requests my standard approach is to use “Recent requests” and identify slow components, but that’s it. And then you have threaddumps, but these are hardly helping here. There is no standard way to diagnose further without relying on gut feeling and luck.

When I had to optimize a request last year, I thought again about this problem. And I asked myself the question:
Whenever I check this request in the threaddumps, I see the code accessing the repository. Why is this the case? Is the repository slow or is it just accessing the repository very frequently?

The available tools cannot answer this question. So I had to write myself something which can do that. In the end I committed it to the Sling codebase with SLING-11654.

The result is an additional logger, (“org.apache.sling.jcr.resource.AccessLogger.operation” on loglevel TRACE) which you can enable and which can you log every single (Sling) repository access, including the operation, the path and the full stacktrace. That is a huge amount of data, but it answered my question quite thoroughly.

  • The repository is itself is very fast, because a request (taking 500ms in my local setup) performs 10’000 times a repository access. So the problem is rather the total number of repository access.
  • Looking at the list of accessed resources it became very obvious, that there is a huge number of redundant access. For example these are the top 10 accessed paths while rendering a simple WKND page (/content/wknd/language-masters/en/adventures/beervana-portland):
    • 1017 /conf/wknd/settings/wcm/templates/adventure-page-template/structure
    • 263 /
    • 237 /conf/wknd/settings/wcm/templates
    • 237 /conf/wknd/settings/wcm
    • 227 /content
    • 204 /content/wknd/language-masters/en
    • 199 /content/wknd
    • 194 /content/wknd/language-masters/en/adventures/beervana-portland/jcr:content
    • 192 /content/wknd/jcr:content
    • 186 /conf/wknd/settings

But now with that logger, I was able to identify access patterns and map them to code. And suddenly you see a much bigger picture, and you can spot a lot of redundant repository access.

With that help I identified the bottleneck in the code, and the posting “Sling Model performance” was the direct result of this finding. Another result was the topic for my talk at AdaptTo() 2023; checkout the recording for more numbers, details and learnings.

But with these experiences I made an important observation: You can use the number of repository access as a proxy metric for performance. The more repository access you do, the slower your application will get. So you don’t need to rely so much on performance tests anymore (although they definitely have their value), but you can validate changes in the code by counting the number of repository access performed by it. Less repository access is always more performant, no matter the environmental conditions.

And with an additional logger (“org.apache.sling.jcr.AccessLogger.statistics” on TRACE) you can get just the raw numbers without details, so you can easily validate any improvement.

Equipped with that knowledge you should be able to investigate the performance of your application on your local machine. Looking forward for the results 🙂

(This is currently only available on AEM CS / AEM CS SDK, I will see to get it into an upcoming AEM 6.5 servicepack.)