3 rules how to use an HttpClient in AEM

Many AEM applications consume data from other systems, and in the last decade the protocol of choice turned out to the HTTP(S). And there are a number of very mature HTTP clients out, which can be used together with AEM. The most frequently used variant is the Apache HttpClient, which is shipped with AEM.

But although the HttpClient is quite easy to use, I came across a number of problems, many of them result in service outages. In this post I want to list the 3 biggest mistakes you can make when you use the Apache HttpClient. While I observed the results in AEM as a Cloud Service, the underlying effects are the same on-prem and in AMS, the resulting effects can be a bit different.

Reuse the HttpClient instance

I often see that a HttpClient instance is created for a single HTTP request, and in many cases it’s not even closed properly afterwards. This can lead to these consequences:

If you don’t close the HttpClient instance properly, the underlying network connection(s) will not be closed properly, but eventually timeout. And until then the network connections stays open. If you using a proxy with a connection limit (many proxies do that) this proxy can reject new requests.
If you re-create a HttpClient for every request, the underlying network connection will get re-established every time with the latency of the 3-way handshake.

The reuse of the HttpClient object and its state is also recommended by its documentation.

The best way to make that happen is to wrap the HttpClient into an OSGI service, create it on activation and stop it when the service is deactivated.

Set agressive connection- and read-timeouts

Especially when an outbund HTTP request should be executed within the context of a AEM request, performance really matters. Every milisecond which is spent in that external call makes the AEM request slower. This increases the risk of exhausting the Jetty thread pool, which then leads to non-availability of that instance, because it cannot accept any new requests. I have often seen AEM CS outages because a backend was not responding slowly or not at all. All requests should finish quickly, and in case of errors must also return fast.

That means, timeouts should not exceed 2 second (personally I would prefer even 1 second). And if your backend cannot respond that fast, you should reconsider its fitness for interactive traffic, and try not to connect to it in a synchronous request.

Implement a degraded mode

When your backend application responds slowly, returns errors or is not available at all, your AEM application should be react accordingly. I had the case a number of times that any problem on the backend had an immediate effect on the AEM application, often resulting in downtimes because either the application was not able to handle the results of the HttpClient (so the response rendering failed with an exception), or because the Jetty threadpool was totally consumed by those requests.

Instead your AEM application should be able to fallback into a degraded mode, which allows you to display at least a message, that something is not working. In the best case the rest of the site continues to work as usual.

If you implement these 3 rules when you do your backend connections, and especially if you test the degraded mode, your AEM application will be much more resilient when it comes to network or backend hiccups, resulting in less service outages. And isn’t that something we all want?

What’s the maximum size of a node in JCR/AEM?

An interesting question which comes up every now and then is: “Is there a limit how large a JCR node can get?”. And as always in IT, the answer is not that simple.

In this post I will answer that question and also outline why this limit is hardly a constraint in AEM development. Also I will show ways how you can design your application so that this limit is not a problem at all.

Continue reading →

Sling Scheduled Jobs vs Sling Scheduler

Apache Sling and AEM provide 2 different approaches to start processes at a given time or in a given interval. It is not always trivial to make the right decision between these two, and I have seen a few cases of misuse already. Let’s dive into this topic and I will outline in what situation to use the Scheduler and when to use Scheduled Jobs.

Continue reading →

How to properly delete a page

A relevant aspect of any piece of content is the livecycle, the process of creation, modification, using and finally deletion of that content. And although the deletion of any page in AEM sounds quite easy, there are quite a few aspects which need to be dealt with. For example:

Create of a version of the page, so it can be restored.
Update the MSM structures (if required)
De-activate the page from publishing.
Create an entry in the audit.log

All this happens when you use one of the pagemanager.delete() function to remove the page. If you are not using it, the most obvious problem you’ll face afterwards is the fact, that you have published pages which you cannot delete anymore (because the page is missing on authoring), and you have to use a workaround for it.

So, please remember: The pagemanager might have overhead in many areas, but there is a reason for it to exist. Taking care of all these mentioned activities is one of it. So whenever you deal with pages (creating/moving/renaming/deleting), first check the pagemanager API before you start using the JCR or Sling API.

Prevent workflow launchers from starting a workflow

Workflow launchers are the standard way to trigger workflows based on changes in the content respository. The most prominent workflow which is triggered that way is the “Asset Update Workflow”, which does all the heavy lifting regarding asset processing. And it’s important to note that this workflow is executed on all changes to an asset itself, its renditions or on metadata.

But often this is not required. If you add more or custom meta data to an asset or even do it in a batch mode, you don’t want to this workflow to run at all; these metatadate changes are not relevant to assets themselves, but just to the way they should be handled in the specific context of your application.

The typical way to make the workflow not to start is to disable the workflow launcher (setting the “enabled” flag to “false”). But this is a global setting which affects all possible invocations, that means also the regular ingestion; and in that case the workflow has to run. So you need a way to specifically disable the workflow to start.

Fortunately there are a few ways how to achieve that, if you have the code under control, which performs the changes, and after which you don’t want the workflow to start again. This is key, because there is a feature available in the workflow launcher (sidenote: I just found that it has been documented; so it often makes sense to check documentation if there have been updates).

You can configure on the workflow launcher an exclusion property in the format “event-user-data:randomString”; this ignores all changes made by a JCR session which has a user-property “randomString” set.

How can you set that property? That’s quite easy:

Session session = ...; session.getWorkspace().getObservationManager().setUserData("randomString"); // do you work with the session session.save();

And by default the “Asset Update Workflow” is configured with “event-user-data:changedByWorkflowProcess”, so if your batch asset-operation sets the user-data to this string “changedByWorkflowProcess”, the “Asset Update Workflow” is not triggered anymore, without disabling the workflow launcher for it.

That’s it. And if you ever wanted to channel data from a saving session to the process which handles the observation events for it (the workflow launchers are just a very convenient way around the JCR Observation API): Just use event.getUserData().

AEM anti-pattern: The hardcoded content structure

One the first things I usually do when we start an AEM project is to get a clear vision of the content and its structure. We normally draw a number of graphs, discuss a number of use cases, and in the end we come up with a content structure, which satisfies the requirements. Then we implement this structure as a hierarchy of nodes and that’s it.

In many cases developers start to use this structure without too much thinking. They assume, that the node structure is always like this. They even start to hardcode paths and language names or mimic this structure. Sometimes that’s not a problem. But it is getting hard, when you are building a multi-language or multi-tenant site and you start simple with only 1 language and 1 tenant; then you might end up with these languages or tenants being hardcoded, as “there was no time to make it right”. Imagine when you start with the second language or the second site and someone hardcoded a language or a site name/path.

So, what can you do to avoid hardcoded paths? Some information is always stored at certain areas. For example you can store basic contact information on the root node, which you can reuse on the whole site. So how do you identify the correct root node if you have multiple sites? Or how do you identify the language of the site?

The easiest way is to mark these site root pages (I prefer pages here over nodes, as they can be created using the authoring UI and are much more easier authorable) with a certain property and value. The easiest way is then if you have a special template with its dedicated resource type. Then you can identify these root pages using 2 approaches:

When you need to find them all, use a JCR query and look for all pages with this specific resource type.
When you need to find the siteroot page for a given page (or resource), just iterate up the hierarchy until you find a page with this resource type.

This mechanism allows you to be very flexible in terms of the content hierarchy. You no longer depend on pages being on a certain level or having special names. It’s all dynamic and you don’t have any dependency on the content structure. This page doesn’t even have to be the root-page of the public facing site, but is just a configuration page used for administration and configuration purposes. The real root-page can be a child or grand-child of it. You have lot’s of choices then.

But wait, there is a single limitation: Every site must have a sitters page using this special template/resourcetype. But that isn’t a hard restriction, isn’t it?

And remember: Never do string operations on a content path to determine something, neither the language nor the site name. Never.