Sling models performance (part 3)

In the first and second part of this series “Sling Models performance” I covered aspects which can degrade the performance of your Sling models, be it by not specifying the correct injector or by re-using complex models for very simple cases (by complex PostConstruct models).

And there is another aspect when it comes to performance degradation, and it starts with a very cool convenience function. Because Sling Models can create a whole tree of objects. Imagine this code as part of a Sling Model:

@ChildResource
AnotherModel child;

It will adapt the child-resource named “child” into the class “AnotherModel” and inject it. This nesting is a cool feature and can be a time-saver if you have a more complex resource structure to model your content.

But also it comes with a price, because it will create another Sling Model object; and even that Sling Model can trigger the creation of more Sling Models, and so on. And as I have outlined in my previous posts, the creation of these Sling Models does not come for free. So if your “main Sling Model” internally creates a whole tree of Sling Models, the required time will increase. Which can be justified, but not if you just need a fraction of the data of the Sling Models. So is it worth to spend 10 miliseconds to create a complex Sling Model just to call a simple getter of it, if you could retrieve this information alone in just 10 microseconds?

So this is a situation, where I need to repeat what I have written already in part 2:


When you build your Sling Models, try to resolve all data lazily, when it is requested the first time.

Sling Model Perforamance (part 2)

But unfortunately, injectors do not work lazily but eagerly; injections are executed as part of construction of the model. Having a lazy injection would be a cool feature …

So until this is available, you should use check the re-use of Sling Model quite carefully; always consider how much work is actually done in the background, and if the value of reusing that Sling Model is worth the time spent in rendering.

The most expensive HTTP request

TL;DR: When you do a performance test for your application, also test a situation where you just fire large number of invalid requests; because you need to know if your error-handling is good enough to withstand this often unplanned load.

In my opinion the most expensive HTTP requests are the ones which return with a 404. Because they don’t bring any value, are not as easily cacheable as others and are very easily to generate. If you are looking into AEM logs, you will often find requests from random parties which fire a lot of requests, obviously trying to find vulnerable software. But in AEM these always fail, because there are not resources with these names, returning a statuscode 404. But this turns a problem if these 404 pages are complex to render, taking 1 second or more. In that case requesting 1000 non-existing URLs can turn into a denial of service.

This can even get more complex, if you work with suffixes, and the end user can just request the suffix, because you prepend that actual resource by mod_rewrite on the dispatcher. In such situations the requested resource is present (the page you configured), but the suffix can be invalid (for example point to a non-existing resource). Depending on the implementation you can find out very late about this situation; and then you have already rendered a major part of the page just to find out that the suffix is invalid. This can also lead to a denial of service, but is much harder to mitigate than the plain 404 case.

So what’s the best way to handle such situations? You should test for such a situation explicitly. Build a simple performance test which just fires a few hundreds requests triggering a 404, and observe the response time of the regular requests. It should not drop! If you need to simplify your 404 pages, then do that! Many popular websites have very stripped down 404 pages for just that reason.

And when you design your URLs you should always have in mind these robots, which just show up with (more or less) random strings.

AEM article review December 2022

I am doing this blog now for quite some time (the first article in this blog dates back to December 2008! That was the time of CQ 5.0! OMG), and of course I am not the only one writing on AEM. Actually the number of articles which are produced every months is quite large, but I am often a bit disappointed because many just reproduce some very basic aspects of AEM, which can be found at many places. But the amount of new content which describe aspects which have barely been covered by other blog posts or the official product documentation is small.

For myself I try to focus on such topics, offer unique views on the product and provide recommendations how things can be done (better), all based on my personal experiences. I think that this type of content is appreciated by the community, and I get good feedback on it. To encourage the broader community to come up with more content covering new aspects I will do a little experiment and promote a few selected articles of others. I think that these article show new aspects or offer a unique view on certain on AEM.

Depending on the feedback I will decide i will continue with this experiment. If you think that your content also offers new views, uncovers hidden features or suggests best practices, please let me know (see the my contact data here). I will judge these proposals on the above mentioned criteria. But of course it will be still my personal decision.

Let’s start with Theo Pendle, who has written an article on how to write your own custom injector for Sling Models. The example he uses is a real good one, and he walks you through all the steps and explains very well, why that is all necessary. I like the general approach of Theos writing and consider the case of safely injecting cookie values as a valid for such a injector. But in general I think that there are not many other cases out there, where it makes sense to write custom injectors.

Also on a technical level John Mitchell has his article “Using Sling Feature Flags to Manage Continous Releases“, published on the Adobe Tech Blog. He introduces Sling Features and how you can use them to implement Feature Flags. And that’s something I have not seen used yet in the wild, and also the documentation is quite sparse on it. But he gives a good starting point, although a more practical example would be great 🙂

The third article I like the most. Kevin Nenning writes on “CRXDE Lite, the plague of AEM“. He outlines why CRXDE Lite has gained such a bad reputation within Adobe, that disabling CRXDE Lite is part of the golive checklist for quite some time. But on the other hand he loves the tool because it’s a great way for quick hacks on your local development instance and for a general read-only tool. This is an article every AEM developer should read.
And in case you haven’t seen it yet: AEM as a Cloud Service offers the repository browser in the developer console for a read-only view on your repo!

And finally there is Yuri Simione (an Adobe AEM champion), who published 2 articles discussing the question “Is AEM a valid Content Services Plattform?” (article 1, article 2). He discusses an implementation which is based on Jackrabbit/Oak and Sling (but not AEM) to replace an aging Documentum system. And finally he offers an interesting perspective on the future of Jackrabbit. Definitely a read if you are interested in a more broader use of AEM and its foundational pieces.

That’s it for December. I hope you enjoy these articles as much as I did, and that you can learn from them and get some new inspiration and insights.

Sling Models performance, part 2

In the last blog post I demonstrated the impact of the correct type of annotations on performance of Sling Models. But there is another aspect of Sling Models, which should not be underestimated. And that’s the impact of the method which is annotated with @PostConstruct.

If you are not interested in the details, just skip to the conclusion at the bottom of this article.

To illustrate this aspect, let me give you an example. Assume that you have a navigation (or list component) in which you want to display only pages of the type “product pages” which are specifically marked to be displayed. Because you are developer which is favoring clean code, you already have a “ProductPageModel” Sling Model which also offers a “showInNav()” method. So your code will look like this:

List<Page> pagesToDisplay = new ArrayList<>();
for (Page child : page.listChildren()) {
  ProductPageModel ppm = child.adaptTo(ProductPageModel.class);
  if (ppm != null && ppm.showInNav()) {
    pagesToDisplay.add(child);
  }
}

This works perfectly fine; but I have seen this approach to be the root cause for severe performance problems. Mostly because the ProductPageModel is designed the one and only Sling Model backing a Product Page; the @PostConstruct method of the ProductPageModel contains all the logic to calculate all retrieve and calculate all required information, for example Product Information, datalayer information, etc.

But in this case only a simple property is required, all other properties are not used at all. That means that the majority of the operations in the @PostConstruct method are pure overhead in this situation and consuming time. It would not be necessary to execute them at all in this case.

Many Sling Models are designed for a single purpose, for example rendering a page, where such a sling model is used extensively by an HTL scriptlet. But there are cases where the very same SlingModel class is used for different purposes, when only a subset of this information is required. But also in this case the whole set of properties is resolved, as it you would need for the rendering of the complete page.

I prepared a small test-case on my github account to illustrate the performance impact of such code on the performance of the adaption:

  • ModelWithPostConstruct contains a method annotated with @PostConstruct, which resolves a another property via an InheritanceValueMap.
  • ModelWithoutPostConstruct provides the same semantic, but executes the calculations lazy, only when the information is required.

The benchmark is implement in a simple servlet (SlingModelPostConstructServlet), which you can invoke on the path “/bin/slingmodelpostconstruct”

$ curl -u admin:admin http://localhost:4502/bin/slingmodelpostconstruct
test data created below /content/cqdump/performance
de.joerghoh.cqdump.performance.core.models.ModelWithPostconstruct: single adaption took 50 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithoutPostconstruct: single adaption took 11 microseconds

The overhead is quite obvious, almost 40 microseconds per adaption; of course it’s dependent on the amount of logic within this @PostConstruct method. And this postconstruct method is quite small, compared to other SlingModels I have seen. And in the cases where only a minimal subset of the information is required, this is pure overhead. Of course the overhead is often minimal if you just consider a single adaption, but given the large number of Sling Models in typical AEM projects, the chance is quite high that this turns into a problem sooner or later.

So you should pay attention on the different situations when you use your Sling Models. Especially if you have such vastly different cases (rendering the full page vs just getting one property) you should invest a bit of time and optimize them for these usecases. Which leads me to the following:

Conclusion

When you build your Sling Models, try to resolve all data lazily, when it is requested the first time. Keep the @PostConstruct method as small as possible.

Sling Model Performance

In my daily job as an SRE for AEM as a Cloud Service I often have to deal with performance questions, especially in the context of migrations of customer applications. Applications sometimes perform differently on AEM CS than they did on AEM 6.x, and a part of my job is to look into these cases.

This often leads to interesting deep dives and learnings; you might have seen this reflected in the postings of this blog 🙂 The problem this time was a tight loop like this:

for (Resource child: resource.getChildren()) {
SlingModel model = child.adaptTo(SlingModel.class);
if (model != null && model.hasSomeCondition()) {
// some very lightweight work
}
}

This code performed well with 1000 child resources in a AEM 6.x authoring instance, but quite poorly on an AEM CS authoring instance with the same number of child nodes. And the problem is not the large number of childnodes …

After wading knee-deep through TRACE logs I found the problem at an unexpected location. But before I present you the solution and some recommendations, let me you explain some background. But of course you can skip the next section and jump directly to the TL;DR at the bottom of this article.

SlingModels and parameter injection

One of the beauties of Sling Models is that these are simple PoJos, and properties are injected by the Sling Models framework. You just have to add matching annotations to mark them accordingly. See the full story in the official documentation.

The simple example in the documentation looks like this:

@Inject
String title;

which (typically) injects the property named “title” from the resource this model was adapted from. The same way you can inject services, child-nodes any many other useful things.

To make this work, the framework uses an ordered list of Injectors, which are able to retrieve values to be injected (see the list of available injectors). The first injector which returns a non-null value is taken and its result is injected. In this example the ValueMapInjector is supposed to return a property called “title” from the valueMap of the resource, which is quite early in the list of injectors.

Ok, now let’s understand what the system does here:

@Inject
@Optional
String doesNotExist;

Here a optional field is declared, and if there is no property called “doesNotExist” in the valueMap of the resource, other injectors are queried if they can handle that injection. Assuming that no injector can do that, the value of the field “doesNotExist” remains null. No problem at first sight.

But indeed there is a problem, and it’s perfomance. To demonstrate it, I wrote a small benchmark (source code on my github account), which does a lot of adaptions to Sling Models. When deployed to AEM 6.5.5 or later (or a recent version of the AEM CS SDK) you can run it via curl -u admin:admin http://localhost:4502/bin/slingmodelcompare

This is its output:

de.joerghoh.cqdump.performance.core.models.ModelWith3Injects: single adaption took 18 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith3ValueMaps: single adaption took 16 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithOptionalValueMap: single adaption took 18 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalValueMaps: single adaption took 20 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithOptionalInject: single adaption took 83 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalInjects: single adaption took 137 microsecond
s

It’s a benchmark which on a very simple list of resources tries adaptions to a number of Model classes, which are different in their type of annotations. So adapting to a model which injects 3 properties takes approximately 20 microseconds, but as soon as a model has a failing injection (which is declared with “@Optional” to avoid failing the adaption), the duration increases massively to 83 microseconds, and even 137 microseconds when 2 these failed injections are there.

Ok, so having a few of such failed injections do not make a problem per se (you could do 2’000 within 100 milliseconds), but this test setup is a bit artificial, which makes these 2’000 a really optimistic number:

  • It is running on a system with a fast repository (SDK on my M1 Macbook); so for example the ChildResourceInjector does not has almost no overhead to test for the presence of a childResource called “doesNotExist”. This can be different, for example on AEM CS Author the Mongo storage has a higher latency than the segmentStore on the SDK or a publish. If that (non-existing) child-resource is not in the cache, there is an additional latency in the range of 1ms to load that information. What for? Well, basically for nothing.
  • The OsgiInjector is queried as well, which tries to access the OSGI ServiceRegistry; this registry is a central piece of OSGI, and it’s consistency is heavily guarded by locks. I have seen this injector being blocked by these locks, which also adds latency.

That means that these 50-60 microseconds could easily multiply, and then the performance is getting a problem. And this is the problem which initially sparked this investigation.

So what can we do to avoid this situation? That is quite easy: Do not use @Inject, but use the specialized injectors directly (see them in the documentation). While the benefit is probably quite small when it comes to properties which are present (ModelWith3Injects tool 18 microseconds vs 16 microseconds of ModelWith3ValueMaps), the different gets dramatic as soon as we consider failed injections:

Even in my local benchmark the improvement can be seen quite easily, there is almost no overhead of such a failed injection, if I explicitly mark them as Injection via the ValueMapInjector. And as mentioned, this overhead can be even larger in reality.

Still, this is a micro-optimization in the majority of all cases; but as mentioned already, many of these optimizations implemented definitely can make a difference.

TL;DR Use injector-specific annotations

Instead of @Inject use directly the correct injector. You normally know exactly where you want that injected value to come from.
And by the way: did you know that the use of @Inject is discouraged in favor of these injector-specific annotations?

(Note to myself: The Sling Models documentation needs an update, especially the examples.)

Limits of dispatcher caching with AEM as a Cloud Service

In the last blog post I proposed 5 rules for Caching with AEM, how you should design your caching strategy. Today I want to show another aspect of rule 1: Prefer caching at the CDN over caching at the dispatcher.

I already explained that the CDN is always located closer to the consumer, so the latency is lower and the experience will be better. But when we limit the scope to AEM as a Cloud Service, the situation gets a bit complicated, because the dispatcher is not able to cache files for more than 24 hours.

This is caused by a few architectural decisions done for AEM as a Cloud Service:

These 2 decisions lead to the fact, that no dispatcher cache can hold files fore more than 24 hours because the instance is terminated after that time. And there are other situations where the publishs are to be re-created, for example during deployments and up/down-scaling situations, and then the cache does not contain files for 24 hours, but maybe just 3 hours.

This naturally can limit the cache-hit ratio in cases where you have content which is requested frequently but is not changed in days/weeks or even months. In an AEM as a Cloud Service setup these files are then rendered once per day (or more often, see above) per publish/dispatcher, while in other setups (for example AMS on on-prem setups where long-living dispatcher caches are pretty much default) it can delivered from the dispatcher cache without the need to re-render it every day.

The CDN does not have this limitation. It can hold for days and weeks and deliver them, if the TTL settings allow this. But as you can control the CDN only via TTL, you have to make a tradeoff between cache-hit ratio on the CDN and the accuracy of the delivered content regarding a potential change.

That means:

  • If you have files which do not change you just set a large TTL to them and then let the CDN handle them. A good example are clientlibs (JS and CSS files), because they have a unique name (an additional selector which is created as a hash over the content of the file.
  • If there’s a chance that you make changes to such content (mostly pages), you should set a reasonable TTL (and of course “stale-while-revalidate”) and accept that your publishs need to re-render these pages when the time has passed.

That’s a bit a drawback of the AEM as a Cloud Service setup, but on the hand side your dispatcher caches are regularly cleared.

Dispatcher, CDN and Caching

In today’s web performance discussions, there is a lot of focus on the browser as the most important. Google defines Web Core Vitals, and there are many other aspects which are important to have a fast site. Plus then SEO …

While many developers focus on these, I see that many sites often neglect the importance of proper caching. While many of these sites already use a CDN (in AEM CS a CDN is part of any offering), they often do not use the CDN in an optimal way; this can result in slow pages (because of the network latency) and also unnecessary load on the backend systems.

In this blog post I want to outline some ways how you can optimize your site for caching, with a focus on AEM in combination with a CDN. It does not really matter if it is AEM as a Cloud Service or AEM on AMS or on-premises, these recommendations can be applied to all of them.

Rule 1: Prefer caching at the CDN over caching at the dispatcher

The dispatcher is located close the AEM instance and typically co-located to your AEM instances. There is a high latency from the dispatcher to the end-user, especially if your end-users are spread across the globe. For example the average latency between Frankfurt/Germany and Sydney/Australia is approximately 250ms, and that makes browsing a website not really fast. Using a decent CDN can cut reduce these numbers dramatically.

Also a CDN is better suited to handle millions of requests per minute than a bunch of VMs running dispatcher instances, both from a cost perspective and from a perspective of knowhow required to operate at that scale.

That means that your caching strategy should aim for an optimal caching at the CDN level. The dispatcher is fine as a secondary cache to handle cache-misses or expired cache items. But no direct enduser request should it ever make through to the dispatcher.

Rule 2: Use TTL-based invalidation

The big advantage of the dispatcher is the direct control of the caching. You deliver your content from the cache, until you change that content. And immediately after the change the cache is actively invalidated, and your changed content is delivered. But you cannot use the same approach for CDNs, and while the CDNs made reasonable improvements to reduce the time to actively invalidate content from the CDNs, it still takes minutes.

A better approach is to use a TTL-based (time-to-live) invalidation (or rather: expiration), where every CDN node can decide on its own if a file in the cache is still valid or not. And if the content is too old, it’s getting refetched from the origin (your dispatchers).

Although this approach introduces some latency from the time of content activation to the time all users world-wide are to see it, such a latency is acceptable in general.

Rule 3: Staleness is not (necessarily) a problem

When you optimize your site, you need not only optimize that every request is requested from the CDN (instead from your dispatchers); but you also should think about what happens if a requested file is expired on the CDN. Ideally it should not matter much.

Imagine that you have a file which is configured with a TTL of 300 seconds. What should happen if this file is requested 301 seconds after it has been stored in the CDN cache. Should the CDN still deliver it (and accept that the user receives a file which can be a bit older than specified) or do you want to the user to wait until the CDN has obtained a fresh copy of that file?
Typically you accept that staleness for a moment and deliver the old copy for a while, until the CDN has obtained a fresh copy in the background. Use the “stale-while revalidate” caching headers to configure this behavior.

Rule 4: Pay attention to the 404s

A HTTP status 404 (“File not found”) is tricky to handle, because by default a 404 is not cached at the CDN. That means that all those requests will hit your dispatcher and eventually even your AEM instances, which are the authoritative source to answer if such a file exists. But the number of requests a AEM instance can handle is much smaller than the number the dispatchers or even the CDN can handle. And you should reserve these precious resources on doing something more useful than responding with “sorry, the resource you requested is not here”.

For that reason check the 404s and handle them appropriately; you have a number of options for that:

  • Fix incorrect links which are under your control.
  • Create dispatcher rules or CDN settings which handle request patterns which you don’t control, and return a 404 from there.
  • You also have the option to allow the CDN to cache a 404 response.

In any way, you should manage the 404s, because they are most expensive type of requests: You spend resources to deliver “nothing”.

Rule 5: Know your query strings

Query strings for requests were used a lot of provide parameters to the server-side rendering process, and you might use that approach as well in your AEM application. But query strings are also used a lot to tag campaign traffic for correct attribution; you might have seen such requests already, they often contain parameters like “utm_source”, “fbclid” etc. But these parameters do not have impact on the server-side rendering!
Because these requests cannot be cached by default, CDN and dispatcher will forward all requests containing any query string to AEM. And that’s again the most scarce resource, and having it rendered there will again impose the latency hit on your site visitors.

The dispatcher has the ability of remove named query strings from the request, which enables it to serve such requests from the dispatcher cache; that’s not as good as serving these requests from the CDN but much much better than handling them on AEM. You should use that as much as possible.

If you follow these rules, you have the chance not to only improve the user experience for your visitors, but at the same time you make your site much more scalable and resilient against attacks and outages.

What’s the maximum size of a node in JCR/AEM?

An interesting question which comes up every now and then is: “Is there a limit how large a JCR node can get?”.  And as always in IT, the answer is not that simple.

In this post I will answer that question and also outline why this limit is hardly a constraint in AEM development. Also I will show ways how you can design your application so that this limit is not a problem at all.

Continue reading “What’s the maximum size of a node in JCR/AEM?”

Sling Scheduled Jobs vs Sling Scheduler

Apache Sling and AEM provide 2 different approaches to start processes at a given time or in a given interval. It is not always trivial to make the right decision between these two, and I have seen a few cases of misuse already. Let’s dive into this topic and I will outline in what situation to use the Scheduler and when to use Scheduled Jobs.

Continue reading “Sling Scheduled Jobs vs Sling Scheduler”

How to analyze “Authentication support missing”

Errors and problems in running software manifest often in very interesting and non-obvious cases. A problem in location A manifests itself only with an unrelated error message in a different location B.

We also have one example of such a situation in AEM, and that’s the famous “Authentication support missing” error message.  I see often the question “I got this error message; what should I do now?”, and so I decided: It’s time to write a blog post about it. Here you are.

“Authentication support missing” is actually not even correct: There is no authentication module available, so you cannot authenticate. But in 99,99% of the cases this is just a symptom. Because the default AEM authentication depends on a running SlingRepository service. And a running Sling repository has a number of dependencies itself.

I want to highlight 2 of these dependencies, because they tend to cause problems most often: The Oak repository and the RepositoryInitializer service. Both must be up and be started/run succesfully until the SlingRepository service is being registered succesfully. Let’s look into each of these dependencies.

The Oak repository

The Oak repository is a quite complex system in itself, and there are many reasons why it did not start. To name a few:

  • Consistency problems with the repository files on disk (for whatever reasons), permission problems on the filesystem, full disks, …
  • Connectivity issues towards the storage (especially if you use a database or mongodb as storage)
  • Messed up configuration

If you have an “authentication support missing” message, you first check should be on the Oak repository, typically reachable in the AEM error.log. If you have an ERROR messages logged by any “org.apache.jackrabbit.oak” class during the startup, this is most likely the culprit. Investigate from there.

Sling Repository Initializer (a.k.a. “repoinit”)

Repoinit is designed to ensure that a certain structure in the repository is provided, even before any consumer is accessing it. All of the available scripts must be executed, and any failure will immediate terminate the startup of the SlingRepositoryService. Check also my latest blog post on Sling Repository Initializer for details how to prevent such problems.

Repoinit failures are typically quite prominent in the AEM error.log, just search for an ERROR message starting with this:

*ERROR* [Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted …

These are 2 biggest contributors to this “Authentication support missing” error messages. Of course there are more reasons why it could appear. But to be honest, I only have seen these 2 cases in the last years.

I hope that this article helps you to investigate such situations more swiftly.