The new AEM CS feature in 2024 which I love most

Pretty much 4 years ago I joined the AEM as a Cloud Service engineering team, and since that time I am working on the platform level as a Site Reliability Engineering. I work on platform reliability and performance and help customers to improve their applications in these aspects.

But that also means, that many features which are released throughout the years are not that relevant for my work. But there are a few ones that matter a lot to me. They allow me to help customers in really good and elegant ways.

In 2024 there was one, which I like very much, and that’s the Traffic Rules feature (next to the custom error page and CDN cache purging as self-service). I like it, because it lets you filter and transform traffic at scale where it can be handled best: At the CDN layer.

Before that feature was available, all traffic handling needed to happen at the dispatcher level. The combination of the Apache httpd and dispatcher rules allowed you to perform all these operations. However, I consider it a bit problematic. Because at that point the traffic already hit the dispatcher instances. It was already in your datacenter, on your servers.

To mitigate that, many customers (both onprem/AMS or AEM CS) purchased a WAF solution to handle specifically these cases. But now with the traffic rules every AEM CS customers gets a new set of features which they can use to handle traffic on the CDN level.

The documentation is quite extensive and contains relevant examples, showcasing the ways how you can block, ratelimit or transform traffic to your needs:

The most compelling reason I rate this as my top feature this year is really the traffic transformation feature.

A part of my daily job is to help customers to prepare their AEM CS instances to handle their traffic spikes. Besides all the tunings on the backend, the biggest angle to improve this sutuation is to handle all these requests at the CDN. Because then it’s not hitting the backend at all.

A constant problem in that situation are request parameters which are added by campaigns. You might know the “utm*”, “fbclid” or “gclid” query parameters when traffic comes to your site which was clicked either on Facebook or Google. And there are many more. Analytics tool need these parameters to attribute traffic to the right source and to measure the effectiveness of campaigns, but from a traffic management point of view these parameters are horrible. Because by default all CDNs and intermediate caches are considering such requests with query strings as non-cacheable. And that means, that all these requests hit your publish instances, and the CDN and the dispatcher caches are mostly useless for that.

It’s possible to remove these request parameters on the dispatcher (using the /IgnoreUrlParams configuration). But with the traffic transformation feature of AEM CS you can remove them also directly on the CDN, so that this traffic is then served entirely from the CDN. That’s the best case situation, because then these requests never make it to origin, which improves latency for end users.

I am very happy about this feature, because with it the scaling calculation gets much easier, when such campaign traffic is handled almost entirely by the CDN. And that’s the whole idea behind using a CDN: To handle the traffic spikes.

For this reason I recommend every AEM CS customer to check out the traffic rules to filter and transform traffic at the CDN level. It is included in every AEM CS offering and you don’t need the extra WAF feature to use it.
Configure these rules to handle all your campaign traffic and increase the cache hit ratio. It’s very powerful and you can use it to make your application much more resilient.

Do not use the Stage environment in your content creation process!

Every now and then (and definitely more often than I ever expected) I come across a question about best practices, how to promote content from the AEM as a Cloud Service Stage environment to Production. The AEM CS standard process does not allow that, and on further request it turns out, that the customers

create and validate the production content on the Stage environment
and when ready, promote that content to the Production environment and publish it.

This approach contradicts quite a bit the CQ5 and AEM good practices (since basically forever!), which say:

Production content is created only on the production environment. The Stage environment is used for code validation and performance testing.

These good practice are directly implemented in AEM CS, and for that reason it is not possible to promote content from Stage to the Production environment.

But there are other implications in AEM CS, when your content creation process takes place on the Stage environment:

If your Stage environment is an integral part of your content creation process, then your Stage environment must not have any lesser SLA than the Production environment. It actually is another production environment. Which is not reflected in the SLAs in AEM CS.
If you use your Stage environment as part of the content creation process, which environment do you use for the final validation and performance testing? In the design of AEM CS this is the role of the Stage environment, because it is sized identical to Production.
in AEM CS the Production Fullstack pipeline covers both Stage and PROD environments, but in serial manner (first Stage and then PROD, often with an extended period of time for approval step in between). That means, that you can update your Stage environment, but not your Production environment, which could impact your content creation process.

For these reasons, do not expand your content creation process on 2 environments. If you have requirements which can only be satisfied with 2 dedicated and independent environments, please talk to Adobe product management early.

I am not saying that the product design is always 100% correct and that if you are wrong if you need 2 environments for content creation. But in most of the cases it was possible to fit the content creation process to the Production environment, especially with the addition of the preview publish. And if that’s still not a fit for your case, talk to Adobe early on, so we can learn about your requirements.

Do not use AEM as a proxy for backend calls

Since I am working with AEM CS customers, I came a few time across the architecture pattern, that requests made to a site to passed all the way through to the AEM instance (bypassing all caches), and then AEM does an outbound request to a backend system (for example a PIM system or other API service, sometimes public, sometimes via VPN), collects the result and sends back the response.

This architectural pattern is problematic in a few ways:

AEM handles requests with a threadpool, which has an upper limit of requests it will handle (by default 200). That means that at any time the number of such backend requests is limited by the amount of AEM instances. In AEM CS this number is variable (auto-scaling), but even in an auto-scaling world there is an upper limit.
The most important factor in the number of such requests AEM can handle per second is the latency of the backend system call. For example if your backend system responds always in less than 100ms, your AEM can handle up to 2000 of such proxy requests per second. If the latency is more likely 1 second, it’s only up to 200 proxy requests per second. This can be enough, this can be way too small.
To achieve such a throughput consistently, you need to have agressive timeouts; if you configure your timeouts with 2 seconds, your guaranteed throughput can only be up to 100 proxy requests/seconds.
And next to all those proxy requests your AEM instances also need to handle the other duties of AEM, most importantly rendering pages and delivering assets. That will reduce the number of threads you can utilize for such backend calls.

The most common issue I have seen with this pattern is that in case of backend performance problems the AEM threadpool of all AEM instances are consumed within seconds, leading almost immediately to an outage of the AEM service. That means, that a problem on the backend or on the connection between AEM and the backend takes down your page rendering abilities, leaving you with what is cached at the CDN level.

The common recommendation we make in these cases is quite obvious: introduce more agressive timeouts. But the actual solution to this problem is a different one:

Do not use AEM as a proxy.

This is a perfect example for a case, where the client (browser) itself can do the integration. Instead of proxy-ing (=tunneling) all backend traffic through AEM, the client could approach the backend service directly. Because then the constraints AEM has (for example the number of concurrent requests) do no longer apply for the calls to the backend. Instead the backend is exposed directly to the endusers, and uses whatever technology is suitable for that; typically it is exposed via an API gateway.

If the backend gets slow, AEM is not affected. If AEM has issues, the backend is not directly impacted because of it. AEM does not even need to know that there is a backend at all. Both systems are entirely decoupled.

As you see, I pretty much prefer this approach of “integration at the frontend layer” and exposing the backend to the endusers over any type of “AEM calls the backend systems”. Mostly because such architectures are less complex and easier to debug and analyze. And that should be your default and preferred approach, whenever this required.

Disclaimer: Yes, there are cases where the application logic requires AEM to do backend calls; but in these cases it’s questionable if such requests need to be done synchronously in requests, meaning that an AEM request needs to do a backend call to consume its result. If these request can be done async, then the whole problem vector I outlined above simply does not exist.

Note: In my opinion hiding the hostnames of your backend system is also not a good reason for such an backend integration. Also “the service is just available from within our company network and AEM accesses it via VPN” is not a good reason, too. In both cases you can achieve the same with an publicly accessible API gateway, which is specifically designed to handle such usecases and all security-relevant implications of it.

So, do not use AEM as a simple proxy!

My view on manual cache flushing

I read the following statement by Samuel Fawaz on LinkedIn regarding the recent announcement of the self-service feature to get the API key for CDN purge for AEM as a Cloud Service:

[…] 𝘚𝘰𝘮𝘦𝘵𝘪𝘮𝘦𝘴 𝘵𝘩𝘦 𝘊𝘋𝘕 𝘤𝘢𝘤𝘩𝘦 𝘪𝘴 𝘫𝘶𝘴𝘵 𝘮𝘦𝘴𝘴𝘦𝘥 𝘶𝘱 𝘢𝘯𝘥 𝘺𝘰𝘶 𝘸𝘢𝘯𝘵 𝘵𝘰 𝘤𝘭𝘦𝘢𝘯 𝘰𝘶𝘵 𝘦𝘷𝘦𝘳𝘺𝘵𝘩𝘪𝘯𝘨. 𝘕𝘰𝘸 𝘺𝘰𝘶 𝘤𝘢𝘯.

I fully agree, that a self-service for this feature was overdue. But I always wonder why an explicit cache flush (both for CDN and dispatcher) is necessary at all.

The caching rules are very simple, as the rules for the AEM as a Cloud Service CDN are all based on the TTL (time-to-live) information sent from AEM or the dispatcher configuration. The caching rules for the dispatcher are equally simple and should be well understood (I find that this blog post on the TechRevel blog covers this topic of dispatcher cache flushing quite well).

In my opinion it should be doable to build a model which allows you to make assumptions, how long it takes for a page update to be visible to all users on the CDN. And it also allows you to reason about more complex situations (especially when content is pulled from multiple pages/areas to render) and understand how and when content changes are getting visible for endusers.

But when I look at the customer requests coming in for cache flushes (CDN and dispatcher), I think that in most cases there is no clear understanding what actually happened; most often it’s just that on the authoring the content is as expected and activated properly, but this change does not show up the same way on publish. The solution is often to request a cache flush (or trigger it yourself) and hope for the best. And very often this fixes the problem, and then the most up-to-date content is delivered.

But is there an understanding why the caches were not updated properly? Honestly, I doubt that very often. The same way as infamous “Windows restart” can fix annoying, suddenly appearing problems with your computer, flushing caches seems be one of the first steps for fixing content problems. The issues goes away, we shrug and go on with our work.

But unlike in the case of Windows the situation is different here, because you have the dispatcher configuration in your git repository. And you know the rules of caching. You have everything you need to have to understand the problem better and even fix it from happening again.

Whenever the authoring users come to you with that request “content is not showing up, please flush the cache”, you should consider this situation as a bug. Because it’s a bug, as the system is not work as expected. You should apply the workaround (do the flush), but afterwards invest time into the analysis and root-cause analysis (RCA), why it happened. Understand and adjust the caching rules. Because very often these cases are well reproducible.

In his LinkedIn post Samuel writes “Sometimes the CDN cache is just messed up“, and I think that is not true. It’s not that it’s a random event you cannot influence at all. On the contrary. It’s an event which is defined by your caching configuration. It’s an event which you can control and prevent, you just need to understand how. And I think that this step of understanding and then fixing it is missing very often. And then the next from request from your authoring users for a cache flush is inevitable, and another cache flush is executed.

In the end flushing caches comes with the price of increased latency for endusers until the cache is populated again. And that’s a situation we should avoid as good as we can.

So as a conclusion:

An explicitly requested cache clear is a bug because it means that something is not working as expected.
And as every bug it should be understood and fixed, so you are no longer required to perform the workaround.

If you have curl, every problem looks like a request

If you are working in IT (or a crafter) you should know the saying: “When you have a hammer, every problem looks like a nail”. It describes the tendency of people, that if they have a tool, which helps them to reliably solve a specific problem, that they will try to use this tool at every other problem, even if it does not fit at all.

Sometimes I see this pattern in AEM as well, but not with a hammer, but with “curl”. Curl is a commandline HTTP client, and it’s quite easy to fire a request against AEM and do something with the output of it. It’s something every AEM developer should be familiar with, also because it’s a great tool to automate things. And if you talk about “automating AEM”, the first thing people often come up with is “curl”…

And there the problem starts: Not every problem can be automated with curl. For example take a periodic data export from AEM. The immediate reaction of most developers (forgive me if I generalize here, but I have seen this pattern too often!) is to write a servlet to pull all this data together, create a CSV list and then use curl to request this servlet every day/week.

Works great, does it? Good, mark that task as done, next!

Wait a second, on prod it takes 2 minutes to create that list. Well, not a problem, right? Until it takes 20 minutes, because the number of assets is growing. And until you move to AEM CS, where the timeout of requests is 60 seconds, and your curl is terminated with a statuscode 503.

So what is the problem? It is not the timeout of 60 seconds; and it’s also not the constantly increasing number of assets. It’s the fact, that this is a batch operation, and you use a communication pattern (request/response), which is not well suited for batch operations. It’s the fact, that you start with curl in mind (a tool which is built for the request/response pattern) and therefor you build the implementation around it this pattern. You have curl, so every problem is solved with a request.

What are the limits of this request/response pattern? Definitely the runtime is a limit, and actually for 3 reasons:

The timeout for requests on AEM CS (or basically any other loadbalancer) is set for security reasons and to keep the prevent misuse. Of course the limit of 60 seconds in AEM CS is a bit arbitrary, but personally I would not wait 60 seconds for a webpage to start rendering. So it’s as good as any higher number.
There is another limit, which is determined by the availability of the backend system, which is actually processing this request. In an high-available and autoscaling environment systems start and stop in an automated fashion, managed by a control-plane which operates on a set of rules. And these rules can enforce, that any (AEM-) system will be forced to shutdown at maximum 10 minutes after it has stopped to receive new requests. And that means for a requests, which would take constantly 30+ minutes, that it might be terminated, without finishing successfully. And it’s unclear if your curl would even realize it (especially if you are streaming the results).
(And technically you can also add that the whole network connection needs to be kept open for that long, and AEM CS itself is just a single factor in there. Also the internet is not always stable, you can experience network hiccups and any point in time. It’s normally just well hidden by retrying failing requests. Which is not an option here, because it won’t solve the problem at all.)

In short: If your task can take long (say: 60+ seconds), then a request is not necessarily the best option to implement it.

So, what options do you have then? Well, the following approach works also in AEM CS:

Use a request to create and initiate your task (let’s call it a “job”);
And then poll the system until this job is completed, then return the result.

This is an asynchronous pattern, and it’s much more scalable when it comes to the amount of processing you can do in there.

Of course you cannot use a single curl command anymore, but now you need to write a program to execute this logic (don’t write it in a shell-script please!); but on the AEM side you can now use either sling jobs or AEM workflows and perform the operation.

But this avoids this restriction on 60 seconds and it can handle restarts of AEM transparently, at least on author side. And you have the huge benefit, that you can collect all your errors during the runtime of this job and decide afterwards, if the execution was a success or failed (which you cannot do in HTTP).

So when you have long-running operations, check if you need to do them within a request. In many cases it’s not required, and then please switch gears to some asynchronous pattern. And that’s something you can do even before the situation starts to get a problem.

Sling Model Exporter: What is exported into the JSON?

Last week we came across a strange phenomenon, when in the AEM release validation process the process broke in an unexpected situation. Which is indeed a good thing, because it covered an aspect I have never thought of.

The validation broke because during a request the serialization of a Sling Model failed with an exception. The short version: It tried to serialize a ResourceResolver(!) into JSON (more details in SLING-11924). Why would anyone serialize a ResourceResolver into a JSON to be consumed by an SPA? I clearly believe that this was not done intentionally, but happened by accident. But nevertheless, it broke the improvement we intended to make, so we had to rollback it and wait for SLING-11924 being implemented.

But it gives me the opportunity to explain, which fields of a Sling Model are exported by the SlingModelExporter. As it is backed by the Jackson data-bind framework, the same rules apply:

All public fields are serialized
all public available getter methods, which do not expect a parameter are serialized.

It is not too hard to check this, but there are a few subtle aspect to consider in the context of Sling Models.

Injections: make sure that you make only these injections as public, which you want to be dealt with by the SlingModelExporter. Make everything else private.
I see often Lombok used to create getters for SlingModels (because you need them for the use in HTL). This is especially problematic, when the annotation @Getter is done on a class-level, because now for every field (not matter the visibility) a getter is created, which is then picked up by the SlingModelExporter.

My call to action: Validate your SlingModels and check them that you don’t export a ResourceResolver by accident. (If you are a AEM as a Cloud Service customer and affected by this problem, you will probably get an email from us, telling you to do exactly that.)

Sling models performance (part 3)

In the first and second part of this series “Sling Models performance” I covered aspects which can degrade the performance of your Sling models, be it by not specifying the correct injector or by re-using complex models for very simple cases (by complex PostConstruct models).

And there is another aspect when it comes to performance degradation, and it starts with a very cool convenience function. Because Sling Models can create a whole tree of objects. Imagine this code as part of a Sling Model:

@ChildResource
AnotherModel child;

It will adapt the child-resource named “child” into the class “AnotherModel” and inject it. This nesting is a cool feature and can be a time-saver if you have a more complex resource structure to model your content.

But also it comes with a price, because it will create another Sling Model object; and even that Sling Model can trigger the creation of more Sling Models, and so on. And as I have outlined in my previous posts, the creation of these Sling Models does not come for free. So if your “main Sling Model” internally creates a whole tree of Sling Models, the required time will increase. Which can be justified, but not if you just need a fraction of the data of the Sling Models. So is it worth to spend 10 miliseconds to create a complex Sling Model just to call a simple getter of it, if you could retrieve this information alone in just 10 microseconds?

So this is a situation, where I need to repeat what I have written already in part 2:

When you build your Sling Models, try to resolve all data lazily, when it is requested the first time.
Sling Model Perforamance (part 2)

But unfortunately, injectors do not work lazily but eagerly; injections are executed as part of construction of the model. Having a lazy injection would be a cool feature …

So until this is available, you should use check the re-use of Sling Model quite carefully; always consider how much work is actually done in the background, and if the value of reusing that Sling Model is worth the time spent in rendering.

The most expensive HTTP request

TL;DR: When you do a performance test for your application, also test a situation where you just fire large number of invalid requests; because you need to know if your error-handling is good enough to withstand this often unplanned load.

In my opinion the most expensive HTTP requests are the ones which return with a 404. Because they don’t bring any value, are not as easily cacheable as others and are very easily to generate. If you are looking into AEM logs, you will often find requests from random parties which fire a lot of requests, obviously trying to find vulnerable software. But in AEM these always fail, because there are not resources with these names, returning a statuscode 404. But this turns a problem if these 404 pages are complex to render, taking 1 second or more. In that case requesting 1000 non-existing URLs can turn into a denial of service.

This can even get more complex, if you work with suffixes, and the end user can just request the suffix, because you prepend that actual resource by mod_rewrite on the dispatcher. In such situations the requested resource is present (the page you configured), but the suffix can be invalid (for example point to a non-existing resource). Depending on the implementation you can find out very late about this situation; and then you have already rendered a major part of the page just to find out that the suffix is invalid. This can also lead to a denial of service, but is much harder to mitigate than the plain 404 case.

So what’s the best way to handle such situations? You should test for such a situation explicitly. Build a simple performance test which just fires a few hundreds requests triggering a 404, and observe the response time of the regular requests. It should not drop! If you need to simplify your 404 pages, then do that! Many popular websites have very stripped down 404 pages for just that reason.

And when you design your URLs you should always have in mind these robots, which just show up with (more or less) random strings.

Sling Models performance, part 2

In the last blog post I demonstrated the impact of the correct type of annotations on performance of Sling Models. But there is another aspect of Sling Models, which should not be underestimated. And that’s the impact of the method which is annotated with @PostConstruct.

If you are not interested in the details, just skip to the conclusion at the bottom of this article.

To illustrate this aspect, let me give you an example. Assume that you have a navigation (or list component) in which you want to display only pages of the type “product pages” which are specifically marked to be displayed. Because you are developer which is favoring clean code, you already have a “ProductPageModel” Sling Model which also offers a “showInNav()” method. So your code will look like this:

List<Page> pagesToDisplay = new ArrayList<>();
for (Page child : page.listChildren()) {
  ProductPageModel ppm = child.adaptTo(ProductPageModel.class);
  if (ppm != null && ppm.showInNav()) {
    pagesToDisplay.add(child);
  }
}

This works perfectly fine; but I have seen this approach to be the root cause for severe performance problems. Mostly because the ProductPageModel is designed the one and only Sling Model backing a Product Page; the @PostConstruct method of the ProductPageModel contains all the logic to calculate all retrieve and calculate all required information, for example Product Information, datalayer information, etc.

But in this case only a simple property is required, all other properties are not used at all. That means that the majority of the operations in the @PostConstruct method are pure overhead in this situation and consuming time. It would not be necessary to execute them at all in this case.

Many Sling Models are designed for a single purpose, for example rendering a page, where such a sling model is used extensively by an HTL scriptlet. But there are cases where the very same SlingModel class is used for different purposes, when only a subset of this information is required. But also in this case the whole set of properties is resolved, as it you would need for the rendering of the complete page.

I prepared a small test-case on my github account to illustrate the performance impact of such code on the performance of the adaption:

ModelWithPostConstruct contains a method annotated with @PostConstruct, which resolves a another property via an InheritanceValueMap.
ModelWithoutPostConstruct provides the same semantic, but executes the calculations lazy, only when the information is required.

The benchmark is implement in a simple servlet (SlingModelPostConstructServlet), which you can invoke on the path “/bin/slingmodelpostconstruct”

$ curl -u admin:admin http://localhost:4502/bin/slingmodelpostconstruct
test data created below /content/cqdump/performance
de.joerghoh.cqdump.performance.core.models.ModelWithPostconstruct: single adaption took 50 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithoutPostconstruct: single adaption took 11 microseconds

The overhead is quite obvious, almost 40 microseconds per adaption; of course it’s dependent on the amount of logic within this @PostConstruct method. And this postconstruct method is quite small, compared to other SlingModels I have seen. And in the cases where only a minimal subset of the information is required, this is pure overhead. Of course the overhead is often minimal if you just consider a single adaption, but given the large number of Sling Models in typical AEM projects, the chance is quite high that this turns into a problem sooner or later.

So you should pay attention on the different situations when you use your Sling Models. Especially if you have such vastly different cases (rendering the full page vs just getting one property) you should invest a bit of time and optimize them for these usecases. Which leads me to the following:

Conclusion

When you build your Sling Models, try to resolve all data lazily, when it is requested the first time. Keep the @PostConstruct method as small as possible.

Sling Model Performance

In my daily job as an SRE for AEM as a Cloud Service I often have to deal with performance questions, especially in the context of migrations of customer applications. Applications sometimes perform differently on AEM CS than they did on AEM 6.x, and a part of my job is to look into these cases.

This often leads to interesting deep dives and learnings; you might have seen this reflected in the postings of this blog 🙂 The problem this time was a tight loop like this:

for (Resource child: resource.getChildren()) { SlingModel model = child.adaptTo(SlingModel.class); if (model != null && model.hasSomeCondition()) { // some very lightweight work } }

This code performed well with 1000 child resources in a AEM 6.x authoring instance, but quite poorly on an AEM CS authoring instance with the same number of child nodes. And the problem is not the large number of childnodes …

After wading knee-deep through TRACE logs I found the problem at an unexpected location. But before I present you the solution and some recommendations, let me you explain some background. But of course you can skip the next section and jump directly to the TL;DR at the bottom of this article.

SlingModels and parameter injection

One of the beauties of Sling Models is that these are simple PoJos, and properties are injected by the Sling Models framework. You just have to add matching annotations to mark them accordingly. See the full story in the official documentation.

The simple example in the documentation looks like this:

@Inject String title;

which (typically) injects the property named “title” from the resource this model was adapted from. The same way you can inject services, child-nodes any many other useful things.

To make this work, the framework uses an ordered list of Injectors, which are able to retrieve values to be injected (see the list of available injectors). The first injector which returns a non-null value is taken and its result is injected. In this example the ValueMapInjector is supposed to return a property called “title” from the valueMap of the resource, which is quite early in the list of injectors.

Ok, now let’s understand what the system does here:

@Inject @Optional String doesNotExist;

Here a optional field is declared, and if there is no property called “doesNotExist” in the valueMap of the resource, other injectors are queried if they can handle that injection. Assuming that no injector can do that, the value of the field “doesNotExist” remains null. No problem at first sight.

But indeed there is a problem, and it’s perfomance. Even the lookup of a non-existing property (or node) in the JCR takes time, and doing this a few hundred or even thousand times in a loop can slow down your code. And a slower repository (like the clustered MongoDB persistence in the AEM as a Cloud Service authoring instances) even more.

To demonstrate it, I wrote a small benchmark (source code on my github account), which does a lot of adaptions to Sling Models. When deployed to AEM 6.5.5 or later (or a recent version of the AEM CS SDK) you can run it via curl -u admin:admin http://localhost:4502/bin/slingmodelcompare

This is its output:

de.joerghoh.cqdump.performance.core.models.ModelWith3Injects: single adaption took 18 microseconds de.joerghoh.cqdump.performance.core.models.ModelWith3ValueMaps: single adaption took 16 microseconds de.joerghoh.cqdump.performance.core.models.ModelWithOptionalValueMap: single adaption took 18 microseconds de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalValueMaps: single adaption took 20 microseconds de.joerghoh.cqdump.performance.core.models.ModelWithOptionalInject: single adaption took 83 microseconds de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalInjects: single adaption took 137 microseconds

It’s a benchmark which on a very simple list of resources tries adaptions to a number of Model classes, which are different in their type of annotations. So adapting to a model which injects 3 properties takes approximately 20 microseconds, but as soon as a model has a failing injection (which is declared with “@Optional” to avoid failing the adaption), the duration increases massively to 83 microseconds, and even 137 microseconds when 2 these failed injections are there.

Ok, so having a few of such failed injections do not make a problem per se (you could do 2’000 within 100 milliseconds), but this test setup is a bit artificial, which makes these 2’000 a really optimistic number:

It is running on a system with a fast repository (SDK on my M1 Macbook); so for example the ChildResourceInjector does not has almost no overhead to test for the presence of a childResource called “doesNotExist”. This can be different, for example on AEM CS Author the Mongo storage has a higher latency than the segmentStore on the SDK or a publish. If that (non-existing) child-resource is not in the cache, there is an additional latency in the range of 1ms to load that information. What for? Well, basically for nothing.
The OsgiInjector is queried as well, which tries to access the OSGI ServiceRegistry; this registry is a central piece of OSGI, and it’s consistency is heavily guarded by locks. I have seen this injector being blocked by these locks, which also adds latency.

That means that these 50-60 microseconds could easily multiply, and then the performance is getting a problem. And this is the problem which initially sparked this investigation.

So what can we do to avoid this situation? That is quite easy: Do not use @Inject, but use the specialized injectors directly (see them in the documentation). While the benefit is probably quite small when it comes to properties which are present (ModelWith3Injects tool 18 microseconds vs 16 microseconds of ModelWith3ValueMaps), the different gets dramatic as soon as we consider failed injections:

1 failed invocation: 18 microseconds (ModelWithOptionalValueMap) vs 83 microseconds (ModelWithOptionalInject)
2 failed invocations: 20 microseconds (ModelWith2OptionalValueMaps) vs 137 microseconds (ModelWith2OptionalInjects)
And with every more failed injections for that model the penalty will increase by another 50-60 microseconds.

Even in my local benchmark the improvement can be seen quite easily, there is almost no overhead of such a failed injection, if I explicitly mark them as Injection via the ValueMapInjector. And as mentioned, this overhead can be even larger in reality.

Still, this is a micro-optimization in the majority of all cases; but as mentioned already, many of these optimizations implemented definitely can make a difference.

TL;DR Use injector-specific annotations

Instead of @Inject use directly the correct injector. You normally know exactly where you want that injected value to come from.
And by the way: did you know that the use of @Inject is discouraged in favor of these injector-specific annotations?

Update: The Sling Models documentation has been updated and explicitly discourages the use of @Inject now.