Limits of dispatcher caching with AEM as a Cloud Service

In the last blog post I proposed 5 rules for Caching with AEM, how you should design your caching strategy. Today I want to show another aspect of rule 1: Prefer caching at the CDN over caching at the dispatcher.

I already explained that the CDN is always located closer to the consumer, so the latency is lower and the experience will be better. But when we limit the scope to AEM as a Cloud Service, the situation gets a bit complicated, because the dispatcher is not able to cache files for more than 24 hours.

This is caused by a few architectural decisions done for AEM as a Cloud Service:

The dispatcher is co-hosted with its AEM publish instance to ensure that there is always a 1:1 relation between the AEM publish and its assigned dispatcher.
All publish instances are re-created every 24 hours.

These 2 decisions lead to the fact, that no dispatcher cache can hold files fore more than 24 hours because the instance is terminated after that time. And there are other situations where the publishs are to be re-created, for example during deployments and up/down-scaling situations, and then the cache does not contain files for 24 hours, but maybe just 3 hours.

This naturally can limit the cache-hit ratio in cases where you have content which is requested frequently but is not changed in days/weeks or even months. In an AEM as a Cloud Service setup these files are then rendered once per day (or more often, see above) per publish/dispatcher, while in other setups (for example AMS on on-prem setups where long-living dispatcher caches are pretty much default) it can delivered from the dispatcher cache without the need to re-render it every day.

The CDN does not have this limitation. It can hold for days and weeks and deliver them, if the TTL settings allow this. But as you can control the CDN only via TTL, you have to make a tradeoff between cache-hit ratio on the CDN and the accuracy of the delivered content regarding a potential change.

That means:

If you have files which do not change you just set a large TTL to them and then let the CDN handle them. A good example are clientlibs (JS and CSS files), because they have a unique name (an additional selector which is created as a hash over the content of the file.
If there’s a chance that you make changes to such content (mostly pages), you should set a reasonable TTL (and of course “stale-while-revalidate”) and accept that your publishs need to re-render these pages when the time has passed.

That’s a bit a drawback of the AEM as a Cloud Service setup, but on the hand side your dispatcher caches are regularly cleared.

Dispatcher, CDN and Caching

In today’s web performance discussions, there is a lot of focus on the browser as the most important. Google defines Web Core Vitals, and there are many other aspects which are important to have a fast site. Plus then SEO …

While many developers focus on these, I see that many sites often neglect the importance of proper caching. While many of these sites already use a CDN (in AEM CS a CDN is part of any offering), they often do not use the CDN in an optimal way; this can result in slow pages (because of the network latency) and also unnecessary load on the backend systems.

In this blog post I want to outline some ways how you can optimize your site for caching, with a focus on AEM in combination with a CDN. It does not really matter if it is AEM as a Cloud Service or AEM on AMS or on-premises, these recommendations can be applied to all of them.

Rule 1: Prefer caching at the CDN over caching at the dispatcher

The dispatcher is located close the AEM instance and typically co-located to your AEM instances. There is a high latency from the dispatcher to the end-user, especially if your end-users are spread across the globe. For example the average latency between Frankfurt/Germany and Sydney/Australia is approximately 250ms, and that makes browsing a website not really fast. Using a decent CDN can cut reduce these numbers dramatically.

Also a CDN is better suited to handle millions of requests per minute than a bunch of VMs running dispatcher instances, both from a cost perspective and from a perspective of knowhow required to operate at that scale.

That means that your caching strategy should aim for an optimal caching at the CDN level. The dispatcher is fine as a secondary cache to handle cache-misses or expired cache items. But no direct enduser request should it ever make through to the dispatcher.

Rule 2: Use TTL-based invalidation

The big advantage of the dispatcher is the direct control of the caching. You deliver your content from the cache, until you change that content. And immediately after the change the cache is actively invalidated, and your changed content is delivered. But you cannot use the same approach for CDNs, and while the CDNs made reasonable improvements to reduce the time to actively invalidate content from the CDNs, it still takes minutes.

A better approach is to use a TTL-based (time-to-live) invalidation (or rather: expiration), where every CDN node can decide on its own if a file in the cache is still valid or not. And if the content is too old, it’s getting refetched from the origin (your dispatchers).

Although this approach introduces some latency from the time of content activation to the time all users world-wide are to see it, such a latency is acceptable in general.

Rule 3: Staleness is not (necessarily) a problem

When you optimize your site, you need not only optimize that every request is requested from the CDN (instead from your dispatchers); but you also should think about what happens if a requested file is expired on the CDN. Ideally it should not matter much.

Imagine that you have a file which is configured with a TTL of 300 seconds. What should happen if this file is requested 301 seconds after it has been stored in the CDN cache. Should the CDN still deliver it (and accept that the user receives a file which can be a bit older than specified) or do you want to the user to wait until the CDN has obtained a fresh copy of that file?
Typically you accept that staleness for a moment and deliver the old copy for a while, until the CDN has obtained a fresh copy in the background. Use the “stale-while revalidate” caching headers to configure this behavior.

Rule 4: Pay attention to the 404s

A HTTP status 404 (“File not found”) is tricky to handle, because by default a 404 is not cached at the CDN. That means that all those requests will hit your dispatcher and eventually even your AEM instances, which are the authoritative source to answer if such a file exists. But the number of requests a AEM instance can handle is much smaller than the number the dispatchers or even the CDN can handle. And you should reserve these precious resources on doing something more useful than responding with “sorry, the resource you requested is not here”.

For that reason check the 404s and handle them appropriately; you have a number of options for that:

Fix incorrect links which are under your control.
Create dispatcher rules or CDN settings which handle request patterns which you don’t control, and return a 404 from there.
You also have the option to allow the CDN to cache a 404 response.

In any way, you should manage the 404s, because they are most expensive type of requests: You spend resources to deliver “nothing”.

Rule 5: Know your query strings

Query strings for requests were used a lot of provide parameters to the server-side rendering process, and you might use that approach as well in your AEM application. But query strings are also used a lot to tag campaign traffic for correct attribution; you might have seen such requests already, they often contain parameters like “utm_source”, “fbclid” etc. But these parameters do not have impact on the server-side rendering!
Because these requests cannot be cached by default, CDN and dispatcher will forward all requests containing any query string to AEM. And that’s again the most scarce resource, and having it rendered there will again impose the latency hit on your site visitors.

The dispatcher has the ability of remove named query strings from the request, which enables it to serve such requests from the dispatcher cache; that’s not as good as serving these requests from the CDN but much much better than handling them on AEM. You should use that as much as possible.

If you follow these rules, you have the chance not to only improve the user experience for your visitors, but at the same time you make your site much more scalable and resilient against attacks and outages.

The web, an eventually consistent system

For many large websites, CDNs are the foundation for delivering content quickly to their customers around the world. The ability of CDNs to cache responses close to consumers also allows these sites to operate on a small hardware footprint. However, compared to what they would have to invest if they operated without a CDN and delivered all content through their own systems, this comes at a cost: your CDN may now deliver content that is out of sync with your origin because you changed the content on your own system. This change is not done in an atomic fashion. This is the same “atomic” as in the ACID principle of database implementations.
This is a conscious decision, and it is caused primarily by the CAP theorem. It states that in a distributed data storage system, you can only achieve 2 of these 3 guarantees:

Consistency
Availability
Partition tolerance

And in the case of a CDN (which is a highly distributed data storage system), its developers usually opt for availability and partition tolerance over consistency. That is, they accept delivering content that is out of date because the originating system has already updated it.

To mitigate this situation the HTTP protocol has features built-in which help to mitigate the problem at least partially. Check out the latest RFC draft on it, it is a really good read. The main feature is called “TTL” (time-to-live) and means that the CDN delivers a version of the content only for a configured time. Afterwards the CDN fetches a new version will from the origin system. The technical term for this is “eventual consistent” because at that point the state of the system with respect to that content is consistent again.

This is the approach all CDNs support, and it works very reliable. But only if you accept that you change content on the origin system and that it will reach your consumers with this delay. The delay is usually set to a period of time that is empirically determined by the website operators, trying to balance the need to deliver fresh content (which requires a very low or no TTL) with the number of requests that the CDN can answer instead of the origin system (in this case, the TTL should be as high as possible). Usually it is in the range of a few minutes.

(Even if you don’t use a CDN for your origin systems, you need these caching instructions, otherwise browsers will make assumptions and cache the requested files on their own. Browsing the web without caching is slow, even on very fast connections. Not to mention what happens when using a mobile device over a slow 3G line … Eventual consistency is an issue you can’t avoid when working on the web.)

Caching is an issue you will always have to deal with when creating web presences. Try to cache as much as possible without neglecting the need to refresh or update content at a random time.

You need to constantly address eventual consistency. Atomic changes (that means changes are immediately available to all consumers) are possible, but they come at a price. You can’t use CDNs for this content; you must deliver it all directly from your origin system. In this case, you need to design your origin system so that it can function without eventual consistency at all (and that’s built in into many systems). Not to mention the additional load it will have to handle.

And for this reason I would always recommend not relying on atomic updates or consistency across your web presence. Always factor in eventual consistency in the delivery of your content. And in most cases even business requirements where “immediate updates” are required can be solved with a TTL of 1 minute. Still not “immediate”, but good enough in 99% of all cases. For the remaining 1% where consistency is mandatory (e.g. real-time stock trading) you need to find a different solution. And I am not sure if the web is always the right technology then.

And as an afterthought regarding TTL: Of course many CDNs offer you the chance to actively invalidate the content, but it often comes with a price. In many cases you can invalidate only single files. Often it is not an immediate action, but takes seconds up to many minutes. And the price is always that you have to have the capacity to handle the load when the CDN needs to refetch a larger chunk of content from your origin system.

Permission sensitive caching

In the last versions of the dispatcher (starting with the 4.0.1 release) Day added a very interesting feature to the dispatcher, which allows one to cache also content on dispatcher level which are not public.

Honwai Wong of the Day support team explained it very well on the TechSummit 2008. I was a bit suprised, but I even found it on slideshare (the first half of the presentation)

Honwai explains the benefits quite well. From my experience you can reduce the load on your CQ publishers (trading a request which requires the rendering of a whole page to to a request, which just checks the ACLs of a page).

If you want to use this feature, you have to make sure that for every group or user, who has to have a individual page, the dispatcher delivers the right one. Imagine you want the present the the logged-in users the latest company news, but not logged-in users shouldn’t get them. And only the managers get the link to the the latest financial data on the startpage. So you need a startpage for 3 different groups (not-logged-in users, logged-in users, managers), and the system should deliver it appropriatly. So having a single home.html isn’t enough, you need to distinguish.

The easiest way (and the Day-way ;-)) is to use a selector denoting the group the user belongs to. So home.group-logged_in.html or home.managers.html would be good. If no selector is given, we assume the user to be an anonymous user. You have to configure the linkchecker to rewrite all links to contain the correct selector. So if a user belongs to the logged_in group and requests the home.logged_in.html page, the dispatcher will ask the CQ ” the user has the following http header lines and is requesting the home.logged_in.html, is it ok?”. CQ then checks if the given http header lines do belong to a user of the group logged_in; because he is, it responses with “200 OK, just go on”. And then the dispatcher will deliver the cached file and there’s no need for the CQ to render the same page again and again. If the users doesn’t belong to that group, CQ will detect that and send a “403 Permission denied”, and the dispatcher forwards this answer then to the user. If a user is member of more than one group, having multiple “group-“selectors is perfectly valid.

Please note: I speak of groups, not of (individual) users. I don’t think that this feature is useful when each user requires a personalized page. The cache-hit ratio is pretty low (especially if you include often-changing content on it, e.g daily news or the content of an RSS feed) and the disk consumption would be huge. If a single page is 20k and you have a version cached for 1000 users, you have a disk usage of 20 MB for a single page! And don’t forget the performance impact of a directory filled up with thousands of files. If you want to personalize pages for users, caching is inappropriate. Of course the usual nasty hacks are applicable, like requesting the user-specific data via an AJAX-call and then modifying the page in the browser using Javascript.

Another note: Currently no documentation is available on the permission sensitive caching. Only the above linked presentation of Honwai Wong.

Creating cachable content using selectors

The major difference between between static object and dynamically created object is that the static ones can be stored in caches; their content they contain does not depend on user or login data, date or other parameters. They look the same on every request. So caching them is a good idea to move the load off the origin system and to accelerate the request-response cycle.

A dynamically created object is influenced by certain parameters (usually username/login, permissions, date/time, but there are countless other) and therefor their content may be different from request to request. These parameters are usually specified as query parameters and must not be cached (see the HTTP 1.1 specification in RFC 2616).

But sometimes it would be great, if we could combine these 2 approaches. For example you want to offer images in 3 resolutions: small (as a preview image e.g in folder view), big (full screen view) and original (the full resolution delivered by the picture-taking device). If you decide to deliver it as static object, it’s cachable. But you need then 3 names (one for each resolution), one for each resolution. Choosing this will blur the fact that these 3 images are the same and differ only in the fact of the image resolution. It creates 3 images instead having only one in 3 instances. Amore practical drawback is that you always have to precompute these 3 pictures and place them on a reachable location. Lazy generation is hard also.

If you choose the dynamic approach, the image would be available as one object for which the instance can be created dynamically. The drawback is here that it cannot be cached.

Day Communique has the feature (the guys of Day ported it also to Apache Sling) to use so-called selectors. They behave like the query parameters one used since the stoneage of the HTTP/HTML era. But they are not query parameters, but merely encoded in the static part of the URL. So the query part of the ULR (as of HTTP 1.1) is no longer needed.

So you can use the URLs /etc/medialib/trafficjam.preview.jpg, /etc/medialib/trafficjam.big.jpg and /etc/medialib/trafficjam.original.jpg to adress the image in the 3 required resolutions. If your dispatcher doesn’t find them in its cache, it will forward the request to your CQ, which can then scale the requested image on demand. Then the dispatcher can store the image and deliver it then from its cache. That’s a very simple and efficient way to make dynamic objects static and offload requests from your application servers.

Caching the right way

I sometimes notice that there is some kind of confusion about how content is transferred from a CQ system to the enduser, mostly regarding caches, cache invalidation and content expiration.

We must make a difference between 2 separate mechanisms:

Caching as in “Communique dispatcher cache”. As already described the dispatcher cache gets only invalidated when a replication agent triggers the invalidation. There isn’t a mechanism which invalidates content after a certain amount of time.
Caching as in “make use of the browser cache”. A RFC to the HTTP standard describes several mechanism to specify the timeframe in which objects are valid. Here is a more informal introduction.

So this 2 mechanism doesn’t collide; if you want to distribute your content effectivly you should use both: The dispatcher cache to lower the load on your CQ systems, and the right HTTP headers to move traffic off your systems (and your internet connection) to downstreamd proxies and browser caches.

Some remarks to the right HTTP headers:

If you don’t have any HTTP headers for caching, most proxies and browsers guess how long they consider an object as “live” or “valid”. Do not rely on these, control it yourself! Add the headers.
CQ doesn’t add any caching header by itself.
An very easy way to add HTTP caching headers is to configure your webserver to add them (for Apache: mod_expires is quite easy to use). Then every time your webserver delivers a object through the dispatcher (either by fetching it from CQ or by retrieving from cache) it will add these headers.