Limits of dispatcher caching with AEM as a Cloud Service

In the last blog post I proposed 5 rules for Caching with AEM, how you should design your caching strategy. Today I want to show another aspect of rule 1: Prefer caching at the CDN over caching at the dispatcher.

I already explained that the CDN is always located closer to the consumer, so the latency is lower and the experience will be better. But when we limit the scope to AEM as a Cloud Service, the situation gets a bit complicated, because the dispatcher is not able to cache files for more than 24 hours.

This is caused by a few architectural decisions done for AEM as a Cloud Service:

These 2 decisions lead to the fact, that no dispatcher cache can hold files fore more than 24 hours because the instance is terminated after that time. And there are other situations where the publishs are to be re-created, for example during deployments and up/down-scaling situations, and then the cache does not contain files for 24 hours, but maybe just 3 hours.

This naturally can limit the cache-hit ratio in cases where you have content which is requested frequently but is not changed in days/weeks or even months. In an AEM as a Cloud Service setup these files are then rendered once per day (or more often, see above) per publish/dispatcher, while in other setups (for example AMS on on-prem setups where long-living dispatcher caches are pretty much default) it can delivered from the dispatcher cache without the need to re-render it every day.

The CDN does not have this limitation. It can hold for days and weeks and deliver them, if the TTL settings allow this. But as you can control the CDN only via TTL, you have to make a tradeoff between cache-hit ratio on the CDN and the accuracy of the delivered content regarding a potential change.

That means:

  • If you have files which do not change you just set a large TTL to them and then let the CDN handle them. A good example are clientlibs (JS and CSS files), because they have a unique name (an additional selector which is created as a hash over the content of the file.
  • If there’s a chance that you make changes to such content (mostly pages), you should set a reasonable TTL (and of course “stale-while-revalidate”) and accept that your publishs need to re-render these pages when the time has passed.

That’s a bit a drawback of the AEM as a Cloud Service setup, but on the hand side your dispatcher caches are regularly cleared.

The web, an eventually consistent system

For many large websites, CDNs are the foundation for delivering content quickly to their customers around the world. The ability of CDNs to cache responses close to consumers also allows these sites to operate on a small hardware footprint. However, compared to what they would have to invest if they operated without a CDN and delivered all content through their own systems, this comes at a cost: your CDN may now deliver content that is out of sync with your origin because you changed the content on your own system. This change is not done in an atomic fashion. This is the same “atomic” as in the ACID principle of database implementations.
This is a conscious decision, and it is caused primarily by the CAP theorem. It states that in a distributed data storage system, you can only achieve 2 of these 3 guarantees:

  • Consistency
  • Availability
  • Partition tolerance

And in the case of a CDN (which is a highly distributed data storage system), its developers usually opt for availability and partition tolerance over consistency. That is, they accept delivering content that is out of date because the originating system has already updated it.

To mitigate this situation the HTTP protocol has features built-in which help to mitigate the problem at least partially. Check out the latest RFC draft on it, it is a really good read. The main feature is called “TTL” (time-to-live) and means that the CDN delivers a version of the content only for a configured time. Afterwards the CDN fetches a new version will from the origin system. The technical term for this is “eventual consistent” because at that point the state of the system with respect to that content is consistent again.

This is the approach all CDNs support, and it works very reliable. But only if you accept that you change content on the origin system and that it will reach your consumers with this delay. The delay is usually set to a period of time that is empirically determined by the website operators, trying to balance the need to deliver fresh content (which requires a very low or no TTL) with the number of requests that the CDN can answer instead of the origin system (in this case, the TTL should be as high as possible). Usually it is in the range of a few minutes.

(Even if you don’t use a CDN for your origin systems, you need these caching instructions, otherwise browsers will make assumptions and cache the requested files on their own. Browsing the web without caching is slow, even on very fast connections. Not to mention what happens when using a mobile device over a slow 3G line … Eventual consistency is an issue you can’t avoid when working on the web.)

Caching is an issue you will always have to deal with when creating web presences. Try to cache as much as possible without neglecting the need to refresh or update content at a random time.

You need to constantly address eventual consistency. Atomic changes (that means changes are immediately available to all consumers) are possible, but they come at a price. You can’t use CDNs for this content; you must deliver it all directly from your origin system. In this case, you need to design your origin system so that it can function without eventual consistency at all (and that’s built in into many systems). Not to mention the additional load it will have to handle.

And for this reason I would always recommend not relying on atomic updates or consistency across your web presence. Always factor in eventual consistency in the delivery of your content. And in most cases even business requirements where “immediate updates” are required can be solved with a TTL of 1 minute. Still not “immediate”, but good enough in 99% of all cases. For the remaining 1% where consistency is mandatory (e.g. real-time stock trading) you need to find a different solution. And I am not sure if the web is always the right technology then.

And as an afterthought regarding TTL: Of course many CDNs offer you the chance to actively invalidate the content, but it often comes with a price. In many cases you can invalidate only single files. Often it is not an immediate action, but takes seconds up to many minutes. And the price is always that you have to have the capacity to handle the load when the CDN needs to refetch a larger chunk of content from your origin system.