3 rules how to use an HttpClient in AEM

Many AEM applications consume data from other systems, and in the last decade the protocol of choice turned out to the HTTP(S). And there are a number of very mature HTTP clients out, which can be used together with AEM. The most frequently used variant is the Apache HttpClient, which is shipped with AEM.

But although the HttpClient is quite easy to use, I came across a number of problems, many of them result in service outages. In this post I want to list the 3 biggest mistakes you can make when you use the Apache HttpClient. While I observed the results in AEM as a Cloud Service, the underlying effects are the same on-prem and in AMS, the resulting effects can be a bit different.

Reuse the HttpClient instance

I often see that a HttpClient instance is created for a single HTTP request, and in many cases it’s not even closed properly afterwards. This can lead to these consequences:

If you don’t close the HttpClient instance properly, the underlying network connection(s) will not be closed properly, but eventually timeout. And until then the network connections stays open. If you using a proxy with a connection limit (many proxies do that) this proxy can reject new requests.
If you re-create a HttpClient for every request, the underlying network connection will get re-established every time with the latency of the 3-way handshake.

The reuse of the HttpClient object and its state is also recommended by its documentation.

The best way to make that happen is to wrap the HttpClient into an OSGI service, create it on activation and stop it when the service is deactivated.

Set agressive connection- and read-timeouts

Especially when an outbund HTTP request should be executed within the context of a AEM request, performance really matters. Every milisecond which is spent in that external call makes the AEM request slower. This increases the risk of exhausting the Jetty thread pool, which then leads to non-availability of that instance, because it cannot accept any new requests. I have often seen AEM CS outages because a backend was not responding slowly or not at all. All requests should finish quickly, and in case of errors must also return fast.

That means, timeouts should not exceed 2 second (personally I would prefer even 1 second). And if your backend cannot respond that fast, you should reconsider its fitness for interactive traffic, and try not to connect to it in a synchronous request.

Implement a degraded mode

When your backend application responds slowly, returns errors or is not available at all, your AEM application should be react accordingly. I had the case a number of times that any problem on the backend had an immediate effect on the AEM application, often resulting in downtimes because either the application was not able to handle the results of the HttpClient (so the response rendering failed with an exception), or because the Jetty threadpool was totally consumed by those requests.

Instead your AEM application should be able to fallback into a degraded mode, which allows you to display at least a message, that something is not working. In the best case the rest of the site continues to work as usual.

If you implement these 3 rules when you do your backend connections, and especially if you test the degraded mode, your AEM application will be much more resilient when it comes to network or backend hiccups, resulting in less service outages. And isn’t that something we all want?

AEM CS & dedicated egress IP

Many customers of AEM as a Cloud Service are used to perform a first level of access control by allowing just a certain set of IP addresses to access a system. For that reason they want that their AEM instances use a static IP address or network range to access their backend systems. AEM CS supports with this with the feature called “dedicated egress IP address“.

But when testing that feature there is often the feedback, that this is not working, and that the incoming requests on backend systems come from a different network range. This is expected, because this feature does not change the default routing for outgoing traffic for the AEM instances.

The documentation also says

Http or https traffic will go through a preconfigured proxy, provided they use standard Java system properties for proxy configurations.

The thing is that if traffic is supposed to use this dedicated egress IP, you have to explicitly make it use this proxy. This is important, because by default not all HTTP Clients do this.

For example, the in the Apache HTTP Client library 4.x, the HttpClients.createDefault() method does not read the system properties related proxying, but the HttpClients.createSystem() does. Same with the java.net.http.HttpClient, for which you need to configure the Builder to use a proxy. Also okhttp requires you to configure the proxy explicitly.

So if requests from your AEM instance is coming from the wrong IP address, check that your code is actually using the configured proxy.

The web, an eventually consistent system

For many large websites, CDNs are the foundation for delivering content quickly to their customers around the world. The ability of CDNs to cache responses close to consumers also allows these sites to operate on a small hardware footprint. However, compared to what they would have to invest if they operated without a CDN and delivered all content through their own systems, this comes at a cost: your CDN may now deliver content that is out of sync with your origin because you changed the content on your own system. This change is not done in an atomic fashion. This is the same “atomic” as in the ACID principle of database implementations.
This is a conscious decision, and it is caused primarily by the CAP theorem. It states that in a distributed data storage system, you can only achieve 2 of these 3 guarantees:

Consistency
Availability
Partition tolerance

And in the case of a CDN (which is a highly distributed data storage system), its developers usually opt for availability and partition tolerance over consistency. That is, they accept delivering content that is out of date because the originating system has already updated it.

To mitigate this situation the HTTP protocol has features built-in which help to mitigate the problem at least partially. Check out the latest RFC draft on it, it is a really good read. The main feature is called “TTL” (time-to-live) and means that the CDN delivers a version of the content only for a configured time. Afterwards the CDN fetches a new version will from the origin system. The technical term for this is “eventual consistent” because at that point the state of the system with respect to that content is consistent again.

This is the approach all CDNs support, and it works very reliable. But only if you accept that you change content on the origin system and that it will reach your consumers with this delay. The delay is usually set to a period of time that is empirically determined by the website operators, trying to balance the need to deliver fresh content (which requires a very low or no TTL) with the number of requests that the CDN can answer instead of the origin system (in this case, the TTL should be as high as possible). Usually it is in the range of a few minutes.

(Even if you don’t use a CDN for your origin systems, you need these caching instructions, otherwise browsers will make assumptions and cache the requested files on their own. Browsing the web without caching is slow, even on very fast connections. Not to mention what happens when using a mobile device over a slow 3G line … Eventual consistency is an issue you can’t avoid when working on the web.)

Caching is an issue you will always have to deal with when creating web presences. Try to cache as much as possible without neglecting the need to refresh or update content at a random time.

You need to constantly address eventual consistency. Atomic changes (that means changes are immediately available to all consumers) are possible, but they come at a price. You can’t use CDNs for this content; you must deliver it all directly from your origin system. In this case, you need to design your origin system so that it can function without eventual consistency at all (and that’s built in into many systems). Not to mention the additional load it will have to handle.

And for this reason I would always recommend not relying on atomic updates or consistency across your web presence. Always factor in eventual consistency in the delivery of your content. And in most cases even business requirements where “immediate updates” are required can be solved with a TTL of 1 minute. Still not “immediate”, but good enough in 99% of all cases. For the remaining 1% where consistency is mandatory (e.g. real-time stock trading) you need to find a different solution. And I am not sure if the web is always the right technology then.

And as an afterthought regarding TTL: Of course many CDNs offer you the chance to actively invalidate the content, but it often comes with a price. In many cases you can invalidate only single files. Often it is not an immediate action, but takes seconds up to many minutes. And the price is always that you have to have the capacity to handle the load when the CDN needs to refetch a larger chunk of content from your origin system.

The problems of multi-tenancy: the development model

in large enterprises AEM project tends to attract many different interested parties, which all love to make use of the features of AEM. They want to get onboard the platform as fast as they can. And this can be a real problem when it comes to such a multi-tenancy AEM platform.

In the previous post I wrote about the governance problems with such projects and all the politics involved in it. These problems pursue also in the daily business of the development and operation of such platforms.

Many of these tenants already have their development partners and agencies, which they are used to work with. These partners have experience in that specific area and know the business. So it’s quite likely, that the tenants continue to work with their partners also in this specific project. And there the technical problems starts.

Because at that point, you’ll realize, that you have multiple teams, which rarely collaborate or in worst case not at all. Teams which might have different skill levels, operate in different development models and use a different tooling. And each one of these teams gets its own prioritization and has its own schedule, and in most cases the amount of communication between these teams is quite low.

So now the platform owner (or the development manager on behalf) needs to setup a development model, which allows these multiple teams to feed all their results into a single platform. A model which doesn’t slow down your development agility and does not negatively impact the platforms stability and performance. And this is quite hard.

A number of these challenges are (note: most of them are not specific to AEM at all!):

How can you ensure communication and collaboration between all development parties? That’s often a part, which is left out (or forgotten) during time and budget estimation, therefor the amount of time spent on it is reflecting this fact. But that’s the most important piece here.
On the other hand, how do you make sure, that overhead of communication and coordination is as low as possible? In most cases this means, that each party gets its own version control system, its own maven module and its own build jobs. This allows a better separation of concerns during development and build time , but just postpones the problem. Because …
How you avoid the case, that multiple parties use the same names, which have to be unique? For example the same path below /apps or the same client library name? It’s hard to detect this at development time, when you don’t have checks, which cover multiple repositories and maven modules.
Somehow related: How do you handle dependencies to the same library but with different versions? Although OSGI supports this also during runtime, AEM isn’t really prepared for such a situation, that you should have a library in both version 1 and version 2. So you need to centrally manage the list of 3rd libraries (including version numbers), which the teams can use.
A huge challenge is testing. When you managed to deploy all delivered artifacts to a single instance (and combining these artifacts into deployable content packages often imposes its own set of problems), how do you test and where do you report issues? How happens the triaging process to assign the issues to the individual teams for fixing? This can cause very easily a culture of blaming and denying, which make the actual bug fixing part very hard.
The same with production problems. No tenant and therefor no development team wants to get blamed for bringing down the platform because of some issue, so each problem can get very political, and teams start to argument, why they are not responsible.
And many more…

These are real world problems, which hurt productivity.

My thoughts how you can overcome (at least) some of the problems:

The platform owner should communicate open to all tenants and involved development teams, and encourage them to adhere to a common development model.
The platform owner should provide clear rules how each team is supposed to work, how they create and share their artifacts, and also clear rules for coding and naming.
The platform owner should be in charge for a small team which is supporting all tenants and all development teams and helps to align requirements and the integration of the different codebases. This team is also responsible for all the 3rd party library management and should have write access to the code repositories of all development teams.
Build and deployment is centralized as well.
Issue triaging is a cross-team effort.

This is all possible in a setup, where the platform owner is not only a function, which is not only responsible to run the platform, but also allowed to exercise control over the deployment artifacts of the individual parties.

Some sidenote: There is an architectural style called „micro services“, which seems to get traction at the moment. It claims to address the „many teams working on a single platform“ problem as well. But the whole idea is based on the split of monolithic application into single self-contained services, which does not really apply to this multi-tenancy problem, where every tenant wants to customize some aspects of the common system for itself. If you apply this approach to this multi-tenancy problem here, you end up with a multi-platform architecture, where each tenant has its own version of the platform.

Connecting dispatchers and publishers

Today I want to cover a question which comes up every now and then (my gut feeling says this question appeared at least once every quarter for the last 5 years …):

How should I connect my dispatchers with the publishs? 1:1, 1:n or m:n?

To give you an impression how these scenarios could look like I graphed the 1:1 and the n:m scenario.

publish-dispatcher-connections-1-1-final — The 1:1 setup, where each dispatcher is connected to exactly 1 publish instance; for the invalidation every publish is also connected only with its assigned dispatcher instance.

publish-dispatcher-connections-N-M-final — The n:m setup, where n dispatcher connce to m publish instances (for illustration here with n=3 and m=3); each dispatch is connected via loadbalancer to each publish instance, but each publish instance needs to invalidate all dispatcher caches.

I want to give you my personal opinion and answer to it. You might get other answers, both from Adobe consultants and other specialists outside of Adobe. They all are valuable insights into the question, how it’s done best in your case. Because it’s your case which matters.

My general answer to this question is: Use a 1:1 connection for these reasons:

it’s easy to debug
it’s easy to monitor
does not require any additional hardware or configuration

From an high-availability point of view this approach seems to have a huge drawback: When either the dispatcher or the publish instance fails, the other part is not available as well.

Before we discuss this, let me state some facts, which I consider as basic and foundation to all my arguments here:

The dispatcher and the web server (I can only speak for Apache HTTPD and its derivates, sorry IIS!) are incredibly stable. In the last 9 years I’ve setup and operated a good number of web environments and I’ve never seen a crashing web server nor a crashing dispatcher module. As long as noone stops the process, this beast is handling requests.
A webserver (and the dispatcher) is capable to deliver thousands of requests per second, if these files originate from the local disks and just need to be delivered. That’s at least 10 times the number any publish can handle.
If you look for the bottleneck in handling HTTP requests in your AEM architecture it’s always the publish application layer. Which is exactly the reason why there is a caching layer (the dispatcher) in front of it.
My assumption is, that a web server on modern hardware (and operating systems) is able to deliver static files with a bandwidth of more than 500 mbit per second (at a mixed file scenario). So in most cases before you reach the limit of your web servers, you reach the limit of your internet connection. Please note, that this number is just a rough guess (depending on many other factors).

Based on these assumptions, let’s consider these scenarios in a 1:1 setup:

When the publish instance fails, the dispatcher instance isn’t fully operational anymore, as it does not reach its renderer instance anymore; so it’s best to take it out of the load balancing pool.
So does this have any effect on the performance capabilities of your architecture? Of course it has, it reduces your ability to deliver static files from the dispatcher cache. Which we could avoid if we had the dispatcher connected to other publishs as well. But as stated above, the delivery performance of static files isn’t a bottle neck at all, so when we take out 1 web server you don’t see any effect.
A webserve/dispatcher fails, and the connected publish instance is not reachable anymore, effectively reducing the power your bottleneck even more.
Admitted, that’s true; but as stated above, I’ve rarely seen a crashed web server; so this case is mostly true in case of hardware problems or massive misconfigurations.

So, your have an measurable impact only in case that a web server hardware went down, in all other cases it’s not a problem for the performance.

This is a small drawbacks, but from my point of view the other benefits stated above outweigh it by far.

This is my standard answer, when there’s no more specific information available. It’s a good rule of thumb. But if you have more specific requirement, it might have sense to change the 1:1 rule to a different one.

For example:

You plan to have 20 publish instances. Then it doesn’t make sense to have 20 webserver/dispatchers as well.
You want to serve a huge amount of static data (e.g. 100 TB of static assets), so your n copies of the same file get expensive in terms of disk space.

If you choose a different approach than the 1:1 scenario described in this blog post, please keep these factors in mind:

How do you plan to invalidate the dispatcher caches? Which publish instance will invalidate which dispatcher cache?
How do you plan to do maintenance of the publish instances?
What’s the effort to add or remove a new publish instance? What’s need to be changed?

Before you plan to spend a lot of time and effort into building a complex dispatcher scenario, please think if a CDN isn’t a more appropriate solution to your problem…