My view on manual cache flushing

I read the following statement by Samuel Fawaz on LinkedIn regarding the recent announcement of the self-service feature to get the API key for CDN purge for AEM as a Cloud Service:

[…] 𝘚𝘰𝘮𝘦𝘵𝘪𝘮𝘦𝘴 𝘵𝘩𝘦 𝘊𝘋𝘕 𝘤𝘢𝘤𝘩𝘦 𝘪𝘴 𝘫𝘶𝘴𝘵 𝘮𝘦𝘴𝘴𝘦𝘥 𝘶𝘱 𝘢𝘯𝘥 𝘺𝘰𝘶 𝘸𝘢𝘯𝘵 𝘵𝘰 𝘤𝘭𝘦𝘢𝘯 𝘰𝘶𝘵 𝘦𝘷𝘦𝘳𝘺𝘵𝘩𝘪𝘯𝘨. 𝘕𝘰𝘸 𝘺𝘰𝘶 𝘤𝘢𝘯.

I fully agree, that a self-service for this feature was overdue. But I always wonder why an explicit cache flush (both for CDN and dispatcher) is necessary at all.

The caching rules are very simple, as the rules for the AEM as a Cloud Service CDN are all based on the TTL (time-to-live) information sent from AEM or the dispatcher configuration. The caching rules for the dispatcher are equally simple and should be well understood (I find that this blog post on the TechRevel blog covers this topic of dispatcher cache flushing quite well).

In my opinion it should be doable to build a model which allows you to make assumptions, how long it takes for a page update to be visible to all users on the CDN. And it also allows you to reason about more complex situations (especially when content is pulled from multiple pages/areas to render) and understand how and when content changes are getting visible for endusers.

But when I look at the customer requests coming in for cache flushes (CDN and dispatcher), I think that in most cases there is no clear understanding what actually happened; most often it’s just that on the authoring the content is as expected and activated properly, but this change does not show up the same way on publish. The solution is often to request a cache flush (or trigger it yourself) and hope for the best. And very often this fixes the problem, and then the most up-to-date content is delivered.

But is there an understanding why the caches were not updated properly? Honestly, I doubt that very often. The same way as infamous “Windows restart” can fix annoying, suddenly appearing problems with your computer, flushing caches seems be one of the first steps for fixing content problems. The issues goes away, we shrug and go on with our work.

But unlike in the case of Windows the situation is different here, because you have the dispatcher configuration in your git repository. And you know the rules of caching. You have everything you need to have to understand the problem better and even fix it from happening again.

Whenever the authoring users come to you with that request “content is not showing up, please flush the cache”, you should consider this situation as a bug. Because it’s a bug, as the system is not work as expected. You should apply the workaround (do the flush), but afterwards invest time into the analysis and root-cause analysis (RCA), why it happened. Understand and adjust the caching rules. Because very often these cases are well reproducible.

In his LinkedIn post Samuel writes “Sometimes the CDN cache is just messed up“, and I think that is not true. It’s not that it’s a random event you cannot influence at all. On the contrary. It’s an event which is defined by your caching configuration. It’s an event which you can control and prevent, you just need to understand how. And I think that this step of understanding and then fixing it is missing very often. And then the next from request from your authoring users for a cache flush is inevitable, and another cache flush is executed.

In the end flushing caches comes with the price of increased latency for endusers until the cache is populated again. And that’s a situation we should avoid as good as we can.

So as a conclusion:

  • An explicitly requested cache clear is a bug because it means that something is not working as expected.
  • And as every bug it should be understood and fixed, so you are no longer required to perform the workaround.

If you have curl, every problem looks like a request

If you are working in IT (or a crafter) you should know the saying: “When you have a hammer, every problem looks like a nail”. It describes the tendency of people, that if they have a tool, which helps them to reliably solve a specific problem, that they will try to use this tool at every other problem, even if it does not fit at all.

Sometimes I see this pattern in AEM as well, but not with a hammer, but with “curl”. Curl is a commandline HTTP client, and it’s quite easy to fire a request against AEM and do something with the output of it. It’s something every AEM developer should be familiar with, also because it’s a great tool to automate things. And if you talk about “automating AEM”, the first thing people often come up with is “curl”…

And there the problem starts: Not every problem can be automated with curl. For example take a periodic data export from AEM. The immediate reaction of most developers (forgive me if I generalize here, but I have seen this pattern too often!) is to write a servlet to pull all this data together, create a CSV list and then use curl to request this servlet every day/week.

Works great, does it? Good, mark that task as done, next!

Wait a second, on prod it takes 2 minutes to create that list. Well, not a problem, right? Until it takes 20 minutes, because the number of assets is growing. And until you move to AEM CS, where the timeout of requests is 60 seconds, and your curl is terminated with a statuscode 503.

So what is the problem? It is not the timeout of 60 seconds; and it’s also not the constantly increasing number of assets. It’s the fact, that this is a batch operation, and you use a communication pattern (request/response), which is not well suited for batch operations. It’s the fact, that you start with curl in mind (a tool which is built for the request/response pattern) and therefor you build the implementation around it this pattern. You have curl, so every problem is solved with a request.

What are the limits of this request/response pattern? Definitely the runtime is a limit, and actually for 3 reasons:

  • The timeout for requests on AEM CS (or basically any other loadbalancer) is set for security reasons and to keep the prevent misuse. Of course the limit of 60 seconds in AEM CS is a bit arbitrary, but personally I would not wait 60 seconds for a webpage to start rendering. So it’s as good as any higher number.
  • There is another limit, which is determined by the availability of the backend system, which is actually processing this request. In an high-available and autoscaling environment systems start and stop in an automated fashion, managed by a control-plane which operates on a set of rules. And these rules can enforce, that any (AEM-) system will be forced to shutdown at maximum 10 minutes after it has stopped to receive new requests. And that means for a requests, which would take constantly 30+ minutes, that it might be terminated, without finishing successfully. And it’s unclear if your curl would even realize it (especially if you are streaming the results).
  • (And technically you can also add that the whole network connection needs to be kept open for that long, and AEM CS itself is just a single factor in there. Also the internet is not always stable, you can experience network hiccups and any point in time. It’s normally just well hidden by retrying failing requests. Which is not an option here, because it won’t solve the problem at all.)

In short: If your task can take long (say: 60+ seconds), then a request is not necessarily the best option to implement it.

So, what options do you have then? Well, the following approach works also in AEM CS:

  1. Use a request to create and initiate your task (let’s call it a “job”);
  2. And then poll the system until this job is completed, then return the result.

This is an asynchronous pattern, and it’s much more scalable when it comes to the amount of processing you can do in there.

Of course you cannot use a single curl command anymore, but now you need to write a program to execute this logic (don’t write it in a shell-script please!); but on the AEM side you can now use either sling jobs or AEM workflows and perform the operation.

But this avoids this restriction on 60 seconds and it can handle restarts of AEM transparently, at least on author side. And you have the huge benefit, that you can collect all your errors during the runtime of this job and decide afterwards, if the execution was a success or failed (which you cannot do in HTTP).

So when you have long-running operations, check if you need to do them within a request. In many cases it’s not required, and then please switch gears to some asynchronous pattern. And that’s something you can do even before the situation starts to get a problem.

Sling Model Exporter: What is exported into the JSON?

Last week we came across a strange phenomenon, when in the AEM release validation process the process broke in an unexpected situation. Which is indeed a good thing, because it covered an aspect I have never thought of.

The validation broke because during a request the serialization of a Sling Model failed with an exception. The short version: It tried to serialize a ResourceResolver(!) into JSON (more details in SLING-11924). Why would anyone serialize a ResourceResolver into a JSON to be consumed by an SPA? I clearly believe that this was not done intentionally, but happened by accident. But nevertheless, it broke the improvement we intended to make, so we had to rollback it and wait for SLING-11924 being implemented.

But it gives me the opportunity to explain, which fields of a Sling Model are exported by the SlingModelExporter. As it is backed by the Jackson data-bind framework, the same rules apply:

  • All public fields are serialized
  • all public available getter methods, which do not expect a parameter are serialized.

It is not too hard to check this, but there are a few subtle aspect to consider in the context of Sling Models.

  • Injections: make sure that you make only these injections as public, which you want to be dealt with by the SlingModelExporter. Make everything else private.
  • I see often Lombok used to create getters for SlingModels (because you need them for the use in HTL). This is especially problematic, when the annotation @Getter is done on a class-level, because now for every field (not matter the visibility) a getter is created, which is then picked up by the SlingModelExporter.

My call to action: Validate your SlingModels and check them that you don’t export a ResourceResolver by accident. (If you are a AEM as a Cloud Service customer and affected by this problem, you will probably get an email from us, telling you to do exactly that.)

Sling models performance (part 3)

In the first and second part of this series “Sling Models performance” I covered aspects which can degrade the performance of your Sling models, be it by not specifying the correct injector or by re-using complex models for very simple cases (by complex PostConstruct models).

And there is another aspect when it comes to performance degradation, and it starts with a very cool convenience function. Because Sling Models can create a whole tree of objects. Imagine this code as part of a Sling Model:

@ChildResource
AnotherModel child;

It will adapt the child-resource named “child” into the class “AnotherModel” and inject it. This nesting is a cool feature and can be a time-saver if you have a more complex resource structure to model your content.

But also it comes with a price, because it will create another Sling Model object; and even that Sling Model can trigger the creation of more Sling Models, and so on. And as I have outlined in my previous posts, the creation of these Sling Models does not come for free. So if your “main Sling Model” internally creates a whole tree of Sling Models, the required time will increase. Which can be justified, but not if you just need a fraction of the data of the Sling Models. So is it worth to spend 10 miliseconds to create a complex Sling Model just to call a simple getter of it, if you could retrieve this information alone in just 10 microseconds?

So this is a situation, where I need to repeat what I have written already in part 2:


When you build your Sling Models, try to resolve all data lazily, when it is requested the first time.

Sling Model Perforamance (part 2)

But unfortunately, injectors do not work lazily but eagerly; injections are executed as part of construction of the model. Having a lazy injection would be a cool feature …

So until this is available, you should use check the re-use of Sling Model quite carefully; always consider how much work is actually done in the background, and if the value of reusing that Sling Model is worth the time spent in rendering.

The most expensive HTTP request

TL;DR: When you do a performance test for your application, also test a situation where you just fire large number of invalid requests; because you need to know if your error-handling is good enough to withstand this often unplanned load.

In my opinion the most expensive HTTP requests are the ones which return with a 404. Because they don’t bring any value, are not as easily cacheable as others and are very easily to generate. If you are looking into AEM logs, you will often find requests from random parties which fire a lot of requests, obviously trying to find vulnerable software. But in AEM these always fail, because there are not resources with these names, returning a statuscode 404. But this turns a problem if these 404 pages are complex to render, taking 1 second or more. In that case requesting 1000 non-existing URLs can turn into a denial of service.

This can even get more complex, if you work with suffixes, and the end user can just request the suffix, because you prepend that actual resource by mod_rewrite on the dispatcher. In such situations the requested resource is present (the page you configured), but the suffix can be invalid (for example point to a non-existing resource). Depending on the implementation you can find out very late about this situation; and then you have already rendered a major part of the page just to find out that the suffix is invalid. This can also lead to a denial of service, but is much harder to mitigate than the plain 404 case.

So what’s the best way to handle such situations? You should test for such a situation explicitly. Build a simple performance test which just fires a few hundreds requests triggering a 404, and observe the response time of the regular requests. It should not drop! If you need to simplify your 404 pages, then do that! Many popular websites have very stripped down 404 pages for just that reason.

And when you design your URLs you should always have in mind these robots, which just show up with (more or less) random strings.

Sling Models performance, part 2

In the last blog post I demonstrated the impact of the correct type of annotations on performance of Sling Models. But there is another aspect of Sling Models, which should not be underestimated. And that’s the impact of the method which is annotated with @PostConstruct.

If you are not interested in the details, just skip to the conclusion at the bottom of this article.

To illustrate this aspect, let me give you an example. Assume that you have a navigation (or list component) in which you want to display only pages of the type “product pages” which are specifically marked to be displayed. Because you are developer which is favoring clean code, you already have a “ProductPageModel” Sling Model which also offers a “showInNav()” method. So your code will look like this:

List<Page> pagesToDisplay = new ArrayList<>();
for (Page child : page.listChildren()) {
  ProductPageModel ppm = child.adaptTo(ProductPageModel.class);
  if (ppm != null && ppm.showInNav()) {
    pagesToDisplay.add(child);
  }
}

This works perfectly fine; but I have seen this approach to be the root cause for severe performance problems. Mostly because the ProductPageModel is designed the one and only Sling Model backing a Product Page; the @PostConstruct method of the ProductPageModel contains all the logic to calculate all retrieve and calculate all required information, for example Product Information, datalayer information, etc.

But in this case only a simple property is required, all other properties are not used at all. That means that the majority of the operations in the @PostConstruct method are pure overhead in this situation and consuming time. It would not be necessary to execute them at all in this case.

Many Sling Models are designed for a single purpose, for example rendering a page, where such a sling model is used extensively by an HTL scriptlet. But there are cases where the very same SlingModel class is used for different purposes, when only a subset of this information is required. But also in this case the whole set of properties is resolved, as it you would need for the rendering of the complete page.

I prepared a small test-case on my github account to illustrate the performance impact of such code on the performance of the adaption:

  • ModelWithPostConstruct contains a method annotated with @PostConstruct, which resolves a another property via an InheritanceValueMap.
  • ModelWithoutPostConstruct provides the same semantic, but executes the calculations lazy, only when the information is required.

The benchmark is implement in a simple servlet (SlingModelPostConstructServlet), which you can invoke on the path “/bin/slingmodelpostconstruct”

$ curl -u admin:admin http://localhost:4502/bin/slingmodelpostconstruct
test data created below /content/cqdump/performance
de.joerghoh.cqdump.performance.core.models.ModelWithPostconstruct: single adaption took 50 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithoutPostconstruct: single adaption took 11 microseconds

The overhead is quite obvious, almost 40 microseconds per adaption; of course it’s dependent on the amount of logic within this @PostConstruct method. And this postconstruct method is quite small, compared to other SlingModels I have seen. And in the cases where only a minimal subset of the information is required, this is pure overhead. Of course the overhead is often minimal if you just consider a single adaption, but given the large number of Sling Models in typical AEM projects, the chance is quite high that this turns into a problem sooner or later.

So you should pay attention on the different situations when you use your Sling Models. Especially if you have such vastly different cases (rendering the full page vs just getting one property) you should invest a bit of time and optimize them for these usecases. Which leads me to the following:

Conclusion

When you build your Sling Models, try to resolve all data lazily, when it is requested the first time. Keep the @PostConstruct method as small as possible.

Sling Model Performance

In my daily job as an SRE for AEM as a Cloud Service I often have to deal with performance questions, especially in the context of migrations of customer applications. Applications sometimes perform differently on AEM CS than they did on AEM 6.x, and a part of my job is to look into these cases.

This often leads to interesting deep dives and learnings; you might have seen this reflected in the postings of this blog 🙂 The problem this time was a tight loop like this:

for (Resource child: resource.getChildren()) {
SlingModel model = child.adaptTo(SlingModel.class);
if (model != null && model.hasSomeCondition()) {
// some very lightweight work
}
}

This code performed well with 1000 child resources in a AEM 6.x authoring instance, but quite poorly on an AEM CS authoring instance with the same number of child nodes. And the problem is not the large number of childnodes …

After wading knee-deep through TRACE logs I found the problem at an unexpected location. But before I present you the solution and some recommendations, let me you explain some background. But of course you can skip the next section and jump directly to the TL;DR at the bottom of this article.

SlingModels and parameter injection

One of the beauties of Sling Models is that these are simple PoJos, and properties are injected by the Sling Models framework. You just have to add matching annotations to mark them accordingly. See the full story in the official documentation.

The simple example in the documentation looks like this:

@Inject
String title;

which (typically) injects the property named “title” from the resource this model was adapted from. The same way you can inject services, child-nodes any many other useful things.

To make this work, the framework uses an ordered list of Injectors, which are able to retrieve values to be injected (see the list of available injectors). The first injector which returns a non-null value is taken and its result is injected. In this example the ValueMapInjector is supposed to return a property called “title” from the valueMap of the resource, which is quite early in the list of injectors.

Ok, now let’s understand what the system does here:

@Inject
@Optional
String doesNotExist;

Here a optional field is declared, and if there is no property called “doesNotExist” in the valueMap of the resource, other injectors are queried if they can handle that injection. Assuming that no injector can do that, the value of the field “doesNotExist” remains null. No problem at first sight.

But indeed there is a problem, and it’s perfomance. Even the lookup of a non-existing property (or node) in the JCR takes time, and doing this a few hundred or even thousand times in a loop can slow down your code. And a slower repository (like the clustered MongoDB persistence in the AEM as a Cloud Service authoring instances) even more.

To demonstrate it, I wrote a small benchmark (source code on my github account), which does a lot of adaptions to Sling Models. When deployed to AEM 6.5.5 or later (or a recent version of the AEM CS SDK) you can run it via curl -u admin:admin http://localhost:4502/bin/slingmodelcompare

This is its output:

de.joerghoh.cqdump.performance.core.models.ModelWith3Injects: single adaption took 18 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith3ValueMaps: single adaption took 16 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithOptionalValueMap: single adaption took 18 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalValueMaps: single adaption took 20 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithOptionalInject: single adaption took 83 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalInjects: single adaption took 137 microsecond
s

It’s a benchmark which on a very simple list of resources tries adaptions to a number of Model classes, which are different in their type of annotations. So adapting to a model which injects 3 properties takes approximately 20 microseconds, but as soon as a model has a failing injection (which is declared with “@Optional” to avoid failing the adaption), the duration increases massively to 83 microseconds, and even 137 microseconds when 2 these failed injections are there.

Ok, so having a few of such failed injections do not make a problem per se (you could do 2’000 within 100 milliseconds), but this test setup is a bit artificial, which makes these 2’000 a really optimistic number:

  • It is running on a system with a fast repository (SDK on my M1 Macbook); so for example the ChildResourceInjector does not has almost no overhead to test for the presence of a childResource called “doesNotExist”. This can be different, for example on AEM CS Author the Mongo storage has a higher latency than the segmentStore on the SDK or a publish. If that (non-existing) child-resource is not in the cache, there is an additional latency in the range of 1ms to load that information. What for? Well, basically for nothing.
  • The OsgiInjector is queried as well, which tries to access the OSGI ServiceRegistry; this registry is a central piece of OSGI, and it’s consistency is heavily guarded by locks. I have seen this injector being blocked by these locks, which also adds latency.

That means that these 50-60 microseconds could easily multiply, and then the performance is getting a problem. And this is the problem which initially sparked this investigation.

So what can we do to avoid this situation? That is quite easy: Do not use @Inject, but use the specialized injectors directly (see them in the documentation). While the benefit is probably quite small when it comes to properties which are present (ModelWith3Injects tool 18 microseconds vs 16 microseconds of ModelWith3ValueMaps), the different gets dramatic as soon as we consider failed injections:

Even in my local benchmark the improvement can be seen quite easily, there is almost no overhead of such a failed injection, if I explicitly mark them as Injection via the ValueMapInjector. And as mentioned, this overhead can be even larger in reality.

Still, this is a micro-optimization in the majority of all cases; but as mentioned already, many of these optimizations implemented definitely can make a difference.

TL;DR Use injector-specific annotations

Instead of @Inject use directly the correct injector. You normally know exactly where you want that injected value to come from.
And by the way: did you know that the use of @Inject is discouraged in favor of these injector-specific annotations?

Update: The Sling Models documentation has been updated and explicitly discourages the use of @Inject now.

Sling Scheduled Jobs vs Sling Scheduler

Apache Sling and AEM provide 2 different approaches to start processes at a given time or in a given interval. It is not always trivial to make the right decision between these two, and I have seen a few cases of misuse already. Let’s dive into this topic and I will outline in what situation to use the Scheduler and when to use Scheduled Jobs.

Continue reading “Sling Scheduled Jobs vs Sling Scheduler”

The deprecation of Sling Resource Events

Sling events are used for many aspects of the system, and initially JCR changes were sent with it. But the OSGI eventing (which the Sling events are built on-top) are not designed for a huge volumes of events (thousands per second); and that is a situation which can happen with AEM; and one of the most compelling reasons to get away from this approach is that all these event handlers (both resource change event and all others) share a single thread-pool.

For that reason the ResourceChangeListeners have been introduced. Here each listener provides detailed information which change it is interested in (restrict by path and type of the change) therefor Sling is able to optimise the listeners on the JCR level; it does not listen for changes when no-one is interested in. This can reduce the load on the system and improve the performance.
For this reason the usage of OSGI Resource Event listeners are deprecated (although they are still working as expected).

How can I find all the ResourceChangeEventListeners in my codebase?

That’s easy, because on startup for each of these ResourceChangeEventListeners you will find a WARN message in the logs like this:

Found OSGi Event Handler for deprecated resource bridge: com.acme.myservice

This will help you to identify all these listeners.

How do I rewrite them to ResourceChangeListeners?

In the majority of cases this should be straight-forward. Make your service implement the ResourceChangeListeners interface and provide these additional OSGI properties:

@Component(
service = ResourceChangeListener.class,
configurationPolicy = ConfigurationPolicy.IGNORE,
property = {
ResourceChangeListener.PATHS + "=/content/dam/asset-imports",
ResourceChangeListener.CHANGES + "=ADDED",
ResourceChangeListener.CHANGES + "=CHANGED",
ResourceChangeListener.CHANGES + "=REMOVED"
})

With this switch you allow resource events to be processed separately in an optimised way; they do not block anymore other OSGI events.

How to handle errors in Sling Servlet requests

Error handling is a topic which developers rarely pay too much attention to. It is done when the API forces them to handle an exception. And the most common pattern I see is the “log and throw” pattern, which means that the exception is logged and then re-thrown.

When you develop in the context of HTTP requests, error handling can get tricky. Because you need to signal the consumer of the response, that an error happened and the request was not successful. Frameworks are designed in a way that they handle any exception internally and set the correct error code if necessary. And Sling is not different from that, if your code throws an exception (for example the postConstruct of a Sling Model), the Sling framework catches it and sets the correct status code 500 (Internal Server Error).

I’ve seen code, which catches exception itself and sets the status code for the response itself. But this is not the right approach, because every exception handled this way the developers implicitly states: “These are my exceptions and I know best how to handle them”; almost as if the developer takes ownership of these exceptions and their root causes, and that there’s nothing which can handle this situation better.

This approach to handle exceptions on its own is not best practice, and I see 2 problems with it:

  • Setting the status code alone is not enough, but the remaining parts of the request processing need to stopped as well. Otherwise the processing continues as nothing happened, which is normally not useful or even allowed. It’s hard to ensure this when the exception is caught.
  • Owning the exception handling removes the responsibility from others. In AEM as a Cloud Service Adobe monitors response codes and the exceptions causing it. And if there’s only a status code 500 but no exception reaching the SlingMainServlet, then it’s likely that this is ignored, because the developer claimed ownership of the exception (handling).

If you write a Sling Servlet or code operating in the context of a request it is best practice not to catch exceptions, but to let them bubble up to the Sling Main Servlet, which is able to handle it appropriately. Handle exceptions by yourself only if you have a better way to deal with them as to log them.