If you have curl, every problem looks like a request

If you are working in IT (or a crafter) you should know the saying: “When you have a hammer, every problem looks like a nail”. It describes the tendency of people, that if they have a tool, which helps them to reliably solve a specific problem, that they will try to use this tool at every other problem, even if it does not fit at all.

Sometimes I see this pattern in AEM as well, but not with a hammer, but with “curl”. Curl is a commandline HTTP client, and it’s quite easy to fire a request against AEM and do something with the output of it. It’s something every AEM developer should be familiar with, also because it’s a great tool to automate things. And if you talk about “automating AEM”, the first thing people often come up with is “curl”…

And there the problem starts: Not every problem can be automated with curl. For example take a periodic data export from AEM. The immediate reaction of most developers (forgive me if I generalize here, but I have seen this pattern too often!) is to write a servlet to pull all this data together, create a CSV list and then use curl to request this servlet every day/week.

Works great, does it? Good, mark that task as done, next!

Wait a second, on prod it takes 2 minutes to create that list. Well, not a problem, right? Until it takes 20 minutes, because the number of assets is growing. And until you move to AEM CS, where the timeout of requests is 60 seconds, and your curl is terminated with a statuscode 503.

So what is the problem? It is not the timeout of 60 seconds; and it’s also not the constantly increasing number of assets. It’s the fact, that this is a batch operation, and you use a communication pattern (request/response), which is not well suited for batch operations. It’s the fact, that you start with curl in mind (a tool which is built for the request/response pattern) and therefor you build the implementation around it this pattern. You have curl, so every problem is solved with a request.

What are the limits of this request/response pattern? Definitely the runtime is a limit, and actually for 3 reasons:

The timeout for requests on AEM CS (or basically any other loadbalancer) is set for security reasons and to keep the prevent misuse. Of course the limit of 60 seconds in AEM CS is a bit arbitrary, but personally I would not wait 60 seconds for a webpage to start rendering. So it’s as good as any higher number.
There is another limit, which is determined by the availability of the backend system, which is actually processing this request. In an high-available and autoscaling environment systems start and stop in an automated fashion, managed by a control-plane which operates on a set of rules. And these rules can enforce, that any (AEM-) system will be forced to shutdown at maximum 10 minutes after it has stopped to receive new requests. And that means for a requests, which would take constantly 30+ minutes, that it might be terminated, without finishing successfully. And it’s unclear if your curl would even realize it (especially if you are streaming the results).
(And technically you can also add that the whole network connection needs to be kept open for that long, and AEM CS itself is just a single factor in there. Also the internet is not always stable, you can experience network hiccups and any point in time. It’s normally just well hidden by retrying failing requests. Which is not an option here, because it won’t solve the problem at all.)

In short: If your task can take long (say: 60+ seconds), then a request is not necessarily the best option to implement it.

So, what options do you have then? Well, the following approach works also in AEM CS:

Use a request to create and initiate your task (let’s call it a “job”);
And then poll the system until this job is completed, then return the result.

This is an asynchronous pattern, and it’s much more scalable when it comes to the amount of processing you can do in there.

Of course you cannot use a single curl command anymore, but now you need to write a program to execute this logic (don’t write it in a shell-script please!); but on the AEM side you can now use either sling jobs or AEM workflows and perform the operation.

But this avoids this restriction on 60 seconds and it can handle restarts of AEM transparently, at least on author side. And you have the huge benefit, that you can collect all your errors during the runtime of this job and decide afterwards, if the execution was a success or failed (which you cannot do in HTTP).

So when you have long-running operations, check if you need to do them within a request. In many cases it’s not required, and then please switch gears to some asynchronous pattern. And that’s something you can do even before the situation starts to get a problem.

Identify repository access

Performance tuning in AEM is typically a tough job. The most obvious and widely known aspect is the tuning of JCR queries, but that’s all; if your code is not doing any JCR query and still slow, it’s getting hard. For requests my standard approach is to use “Recent requests” and identify slow components, but that’s it. And then you have threaddumps, but these are hardly helping here. There is no standard way to diagnose further without relying on gut feeling and luck.

When I had to optimize a request last year, I thought again about this problem. And I asked myself the question:
Whenever I check this request in the threaddumps, I see the code accessing the repository. Why is this the case? Is the repository slow or is it just accessing the repository very frequently?

The available tools cannot answer this question. So I had to write myself something which can do that. In the end I committed it to the Sling codebase with SLING-11654.

The result is an additional logger, (“org.apache.sling.jcr.resource.AccessLogger.operation” on loglevel TRACE) which you can enable and which can you log every single (Sling) repository access, including the operation, the path and the full stacktrace. That is a huge amount of data, but it answered my question quite thoroughly.

The repository is itself is very fast, because a request (taking 500ms in my local setup) performs 10’000 times a repository access. So the problem is rather the total number of repository access.
Looking at the list of accessed resources it became very obvious, that there is a huge number of redundant access. For example these are the top 10 accessed paths while rendering a simple WKND page (/content/wknd/language-masters/en/adventures/beervana-portland):
- 1017 /conf/wknd/settings/wcm/templates/adventure-page-template/structure
- 263 /
- 237 /conf/wknd/settings/wcm/templates
- 237 /conf/wknd/settings/wcm
- 227 /content
- 204 /content/wknd/language-masters/en
- 199 /content/wknd
- 194 /content/wknd/language-masters/en/adventures/beervana-portland/jcr:content
- 192 /content/wknd/jcr:content
- 186 /conf/wknd/settings

But now with that logger, I was able to identify access patterns and map them to code. And suddenly you see a much bigger picture, and you can spot a lot of redundant repository access.

With that help I identified the bottleneck in the code, and the posting “Sling Model performance” was the direct result of this finding. Another result was the topic for my talk at AdaptTo() 2023; checkout the recording for more numbers, details and learnings.

But with these experiences I made an important observation: You can use the number of repository access as a proxy metric for performance. The more repository access you do, the slower your application will get. So you don’t need to rely so much on performance tests anymore (although they definitely have their value), but you can validate changes in the code by counting the number of repository access performed by it. Less repository access is always more performant, no matter the environmental conditions.

And with an additional logger (“org.apache.sling.jcr.AccessLogger.statistics” on TRACE) you can get just the raw numbers without details, so you can easily validate any improvement.

Equipped with that knowledge you should be able to investigate the performance of your application on your local machine. Looking forward for the results 🙂

(This is currently only available on AEM CS / AEM CS SDK, I will see to get it into an upcoming AEM 6.5 servicepack.)

The Explain Query tool

When there’s a topic which has been challenging forever in the AEM world, then it’s JCR queries and indexes. It can feel like an arcane science, where it’s quite easy to mess up and end up with a slow query. I learned it also the hard way, and a printout of the JCR query cheatsheet is always below my keyboard.

But there were some recent changes, which made the work with query performance easier. First, in AEM CS the Explain Query tool has been added, which is also available via the AEM Developer Console. It displays queries, slow queries, number of rows read, the used index, execution plan etc. But even with that tool alone it’s still hard to understand what makes a query performant or slow.

Last week there was a larger update to the AEM documentation (thanks a lot, Tom!), which added a detailed explanation of the Explain Query tool. Especially it drills down into the details of the query execution plan and how to interpret it.

With this information and the good examples given there you should be able to analyze the query plan of your queries and optimize the indexes and queries before you execute them the first time in production.

3 rules how to use an HttpClient in AEM

Many AEM applications consume data from other systems, and in the last decade the protocol of choice turned out to the HTTP(S). And there are a number of very mature HTTP clients out, which can be used together with AEM. The most frequently used variant is the Apache HttpClient, which is shipped with AEM.

But although the HttpClient is quite easy to use, I came across a number of problems, many of them result in service outages. In this post I want to list the 3 biggest mistakes you can make when you use the Apache HttpClient. While I observed the results in AEM as a Cloud Service, the underlying effects are the same on-prem and in AMS, the resulting effects can be a bit different.

Reuse the HttpClient instance

I often see that a HttpClient instance is created for a single HTTP request, and in many cases it’s not even closed properly afterwards. This can lead to these consequences:

If you don’t close the HttpClient instance properly, the underlying network connection(s) will not be closed properly, but eventually timeout. And until then the network connections stays open. If you using a proxy with a connection limit (many proxies do that) this proxy can reject new requests.
If you re-create a HttpClient for every request, the underlying network connection will get re-established every time with the latency of the 3-way handshake.

The reuse of the HttpClient object and its state is also recommended by its documentation.

The best way to make that happen is to wrap the HttpClient into an OSGI service, create it on activation and stop it when the service is deactivated.

Set agressive connection- and read-timeouts

Especially when an outbund HTTP request should be executed within the context of a AEM request, performance really matters. Every milisecond which is spent in that external call makes the AEM request slower. This increases the risk of exhausting the Jetty thread pool, which then leads to non-availability of that instance, because it cannot accept any new requests. I have often seen AEM CS outages because a backend was not responding slowly or not at all. All requests should finish quickly, and in case of errors must also return fast.

That means, timeouts should not exceed 2 second (personally I would prefer even 1 second). And if your backend cannot respond that fast, you should reconsider its fitness for interactive traffic, and try not to connect to it in a synchronous request.

Implement a degraded mode

When your backend application responds slowly, returns errors or is not available at all, your AEM application should be react accordingly. I had the case a number of times that any problem on the backend had an immediate effect on the AEM application, often resulting in downtimes because either the application was not able to handle the results of the HttpClient (so the response rendering failed with an exception), or because the Jetty threadpool was totally consumed by those requests.

Instead your AEM application should be able to fallback into a degraded mode, which allows you to display at least a message, that something is not working. In the best case the rest of the site continues to work as usual.

If you implement these 3 rules when you do your backend connections, and especially if you test the degraded mode, your AEM application will be much more resilient when it comes to network or backend hiccups, resulting in less service outages. And isn’t that something we all want?

Recap: AdaptTo 2023

It was adapTo() time again, the first time again in an in-person format since 2019. And it’s definitely much different from the virtual formats we experienced during the pandemic. More personal, and allowing me to get away from the daily work routine; I remember that in 2020 and 2021 I constantly had work related topics (mostly Slack) on the other screen, while I was attending the virtual conference. That’s definitely different when you are at the venue 🙂

And it was great to see all the people again. Many of the people which are part of the community for years, but also many new faces. Nice to see that the community can still attract new people, although I think that the golden time of the backend-heavy web-development is over. And that was reflected on stage as well, with Edge Delivery Services being quite a topic.

As in the past years, the conference itself isn’t that large (this year maybe 200 attendees) and it gives you plenty of chances to get in touch and chat about projects, new features, bugs and everything else you can imagine. The location is nice, and Berlin gives you plenty of opportunities to go out for dinner. So while 3 days of conference can definitely be exhausting, I would have liked to spend much more dinners with attendees.

I got the chance to come on stage again with one of my favorite topics: Performance improvement in AEM, a classic backend topic. According to the talk feedback, people liked it 🙂
Also, the folks of the adaptTo() recorded all the talks and you can find both the recording and the slide deck on the talk’s page.

The next call for papers is already announced to start in February ’24), and I will definitely submit a talk again. Maybe you as well?

AEM CS & dedicated egress IP

Many customers of AEM as a Cloud Service are used to perform a first level of access control by allowing just a certain set of IP addresses to access a system. For that reason they want that their AEM instances use a static IP address or network range to access their backend systems. AEM CS supports with this with the feature called “dedicated egress IP address“.

But when testing that feature there is often the feedback, that this is not working, and that the incoming requests on backend systems come from a different network range. This is expected, because this feature does not change the default routing for outgoing traffic for the AEM instances.

The documentation also says

Http or https traffic will go through a preconfigured proxy, provided they use standard Java system properties for proxy configurations.

The thing is that if traffic is supposed to use this dedicated egress IP, you have to explicitly make it use this proxy. This is important, because by default not all HTTP Clients do this.

For example, the in the Apache HTTP Client library 4.x, the HttpClients.createDefault() method does not read the system properties related proxying, but the HttpClients.createSystem() does. Same with the java.net.http.HttpClient, for which you need to configure the Builder to use a proxy. Also okhttp requires you to configure the proxy explicitly.

So if requests from your AEM instance is coming from the wrong IP address, check that your code is actually using the configured proxy.

Sling Model Exporter: What is exported into the JSON?

Last week we came across a strange phenomenon, when in the AEM release validation process the process broke in an unexpected situation. Which is indeed a good thing, because it covered an aspect I have never thought of.

The validation broke because during a request the serialization of a Sling Model failed with an exception. The short version: It tried to serialize a ResourceResolver(!) into JSON (more details in SLING-11924). Why would anyone serialize a ResourceResolver into a JSON to be consumed by an SPA? I clearly believe that this was not done intentionally, but happened by accident. But nevertheless, it broke the improvement we intended to make, so we had to rollback it and wait for SLING-11924 being implemented.

But it gives me the opportunity to explain, which fields of a Sling Model are exported by the SlingModelExporter. As it is backed by the Jackson data-bind framework, the same rules apply:

All public fields are serialized
all public available getter methods, which do not expect a parameter are serialized.

It is not too hard to check this, but there are a few subtle aspect to consider in the context of Sling Models.

Injections: make sure that you make only these injections as public, which you want to be dealt with by the SlingModelExporter. Make everything else private.
I see often Lombok used to create getters for SlingModels (because you need them for the use in HTL). This is especially problematic, when the annotation @Getter is done on a class-level, because now for every field (not matter the visibility) a getter is created, which is then picked up by the SlingModelExporter.

My call to action: Validate your SlingModels and check them that you don’t export a ResourceResolver by accident. (If you are a AEM as a Cloud Service customer and affected by this problem, you will probably get an email from us, telling you to do exactly that.)

Sling models performance (part 3)

In the first and second part of this series “Sling Models performance” I covered aspects which can degrade the performance of your Sling models, be it by not specifying the correct injector or by re-using complex models for very simple cases (by complex PostConstruct models).

And there is another aspect when it comes to performance degradation, and it starts with a very cool convenience function. Because Sling Models can create a whole tree of objects. Imagine this code as part of a Sling Model:

@ChildResource
AnotherModel child;

It will adapt the child-resource named “child” into the class “AnotherModel” and inject it. This nesting is a cool feature and can be a time-saver if you have a more complex resource structure to model your content.

But also it comes with a price, because it will create another Sling Model object; and even that Sling Model can trigger the creation of more Sling Models, and so on. And as I have outlined in my previous posts, the creation of these Sling Models does not come for free. So if your “main Sling Model” internally creates a whole tree of Sling Models, the required time will increase. Which can be justified, but not if you just need a fraction of the data of the Sling Models. So is it worth to spend 10 miliseconds to create a complex Sling Model just to call a simple getter of it, if you could retrieve this information alone in just 10 microseconds?

So this is a situation, where I need to repeat what I have written already in part 2:

When you build your Sling Models, try to resolve all data lazily, when it is requested the first time.
Sling Model Perforamance (part 2)

But unfortunately, injectors do not work lazily but eagerly; injections are executed as part of construction of the model. Having a lazy injection would be a cool feature …

So until this is available, you should use check the re-use of Sling Model quite carefully; always consider how much work is actually done in the background, and if the value of reusing that Sling Model is worth the time spent in rendering.

The most expensive HTTP request

TL;DR: When you do a performance test for your application, also test a situation where you just fire large number of invalid requests; because you need to know if your error-handling is good enough to withstand this often unplanned load.

In my opinion the most expensive HTTP requests are the ones which return with a 404. Because they don’t bring any value, are not as easily cacheable as others and are very easily to generate. If you are looking into AEM logs, you will often find requests from random parties which fire a lot of requests, obviously trying to find vulnerable software. But in AEM these always fail, because there are not resources with these names, returning a statuscode 404. But this turns a problem if these 404 pages are complex to render, taking 1 second or more. In that case requesting 1000 non-existing URLs can turn into a denial of service.

This can even get more complex, if you work with suffixes, and the end user can just request the suffix, because you prepend that actual resource by mod_rewrite on the dispatcher. In such situations the requested resource is present (the page you configured), but the suffix can be invalid (for example point to a non-existing resource). Depending on the implementation you can find out very late about this situation; and then you have already rendered a major part of the page just to find out that the suffix is invalid. This can also lead to a denial of service, but is much harder to mitigate than the plain 404 case.

So what’s the best way to handle such situations? You should test for such a situation explicitly. Build a simple performance test which just fires a few hundreds requests triggering a 404, and observe the response time of the regular requests. It should not drop! If you need to simplify your 404 pages, then do that! Many popular websites have very stripped down 404 pages for just that reason.

And when you design your URLs you should always have in mind these robots, which just show up with (more or less) random strings.

AEM article review December 2022

I am doing this blog now for quite some time (the first article in this blog dates back to December 2008! That was the time of CQ 5.0! OMG), and of course I am not the only one writing on AEM. Actually the number of articles which are produced every months is quite large, but I am often a bit disappointed because many just reproduce some very basic aspects of AEM, which can be found at many places. But the amount of new content which describe aspects which have barely been covered by other blog posts or the official product documentation is small.

For myself I try to focus on such topics, offer unique views on the product and provide recommendations how things can be done (better), all based on my personal experiences. I think that this type of content is appreciated by the community, and I get good feedback on it. To encourage the broader community to come up with more content covering new aspects I will do a little experiment and promote a few selected articles of others. I think that these article show new aspects or offer a unique view on certain on AEM.

Depending on the feedback I will decide i will continue with this experiment. If you think that your content also offers new views, uncovers hidden features or suggests best practices, please let me know (see the my contact data here). I will judge these proposals on the above mentioned criteria. But of course it will be still my personal decision.

Let’s start with Theo Pendle, who has written an article on how to write your own custom injector for Sling Models. The example he uses is a real good one, and he walks you through all the steps and explains very well, why that is all necessary. I like the general approach of Theos writing and consider the case of safely injecting cookie values as a valid for such a injector. But in general I think that there are not many other cases out there, where it makes sense to write custom injectors.

Also on a technical level John Mitchell has his article “Using Sling Feature Flags to Manage Continous Releases“, published on the Adobe Tech Blog. He introduces Sling Features and how you can use them to implement Feature Flags. And that’s something I have not seen used yet in the wild, and also the documentation is quite sparse on it. But he gives a good starting point, although a more practical example would be great 🙂

The third article I like the most. Kevin Nenning writes on “CRXDE Lite, the plague of AEM“. He outlines why CRXDE Lite has gained such a bad reputation within Adobe, that disabling CRXDE Lite is part of the golive checklist for quite some time. But on the other hand he loves the tool because it’s a great way for quick hacks on your local development instance and for a general read-only tool. This is an article every AEM developer should read.
And in case you haven’t seen it yet: AEM as a Cloud Service offers the repository browser in the developer console for a read-only view on your repo!

And finally there is Yuri Simione (an Adobe AEM champion), who published 2 articles discussing the question “Is AEM a valid Content Services Plattform?” (article 1, article 2). He discusses an implementation which is based on Jackrabbit/Oak and Sling (but not AEM) to replace an aging Documentum system. And finally he offers an interesting perspective on the future of Jackrabbit. Definitely a read if you are interested in a more broader use of AEM and its foundational pieces.

That’s it for December. I hope you enjoy these articles as much as I did, and that you can learn from them and get some new inspiration and insights.