My view on manual cache flushing

I read the following statement by Samuel Fawaz on LinkedIn regarding the recent announcement of the self-service feature to get the API key for CDN purge for AEM as a Cloud Service:

[…] ๐˜š๐˜ฐ๐˜ฎ๐˜ฆ๐˜ต๐˜ช๐˜ฎ๐˜ฆ๐˜ด ๐˜ต๐˜ฉ๐˜ฆ ๐˜Š๐˜‹๐˜• ๐˜ค๐˜ข๐˜ค๐˜ฉ๐˜ฆ ๐˜ช๐˜ด ๐˜ซ๐˜ถ๐˜ด๐˜ต ๐˜ฎ๐˜ฆ๐˜ด๐˜ด๐˜ฆ๐˜ฅ ๐˜ถ๐˜ฑ ๐˜ข๐˜ฏ๐˜ฅ ๐˜บ๐˜ฐ๐˜ถ ๐˜ธ๐˜ข๐˜ฏ๐˜ต ๐˜ต๐˜ฐ ๐˜ค๐˜ญ๐˜ฆ๐˜ข๐˜ฏ ๐˜ฐ๐˜ถ๐˜ต ๐˜ฆ๐˜ท๐˜ฆ๐˜ณ๐˜บ๐˜ต๐˜ฉ๐˜ช๐˜ฏ๐˜จ. ๐˜•๐˜ฐ๐˜ธ ๐˜บ๐˜ฐ๐˜ถ ๐˜ค๐˜ข๐˜ฏ.

I fully agree, that a self-service for this feature was overdue. But I always wonder why an explicit cache flush (both for CDN and dispatcher) is necessary at all.

The caching rules are very simple, as the rules for the AEM as a Cloud Service CDN are all based on the TTL (time-to-live) information sent from AEM or the dispatcher configuration. The caching rules for the dispatcher are equally simple and should be well understood (I find that this blog post on the TechRevel blog covers this topic of dispatcher cache flushing quite well).

In my opinion it should be doable to build a model which allows you to make assumptions, how long it takes for a page update to be visible to all users on the CDN. And it also allows you to reason about more complex situations (especially when content is pulled from multiple pages/areas to render) and understand how and when content changes are getting visible for endusers.

But when I look at the customer requests coming in for cache flushes (CDN and dispatcher), I think that in most cases there is no clear understanding what actually happened; most often it’s just that on the authoring the content is as expected and activated properly, but this change does not show up the same way on publish. The solution is often to request a cache flush (or trigger it yourself) and hope for the best. And very often this fixes the problem, and then the most up-to-date content is delivered.

But is there an understanding why the caches were not updated properly? Honestly, I doubt that very often. The same way as infamous “Windows restart” can fix annoying, suddenly appearing problems with your computer, flushing caches seems be one of the first steps for fixing content problems. The issues goes away, we shrug and go on with our work.

But unlike in the case of Windows the situation is different here, because you have the dispatcher configuration in your git repository. And you know the rules of caching. You have everything you need to have to understand the problem better and even fix it from happening again.

Whenever the authoring users come to you with that request “content is not showing up, please flush the cache”, you should consider this situation as a bug. Because it’s a bug, as the system is not work as expected. You should apply the workaround (do the flush), but afterwards invest time into the analysis and root-cause analysis (RCA), why it happened. Understand and adjust the caching rules. Because very often these cases are well reproducible.

In his LinkedIn post Samuel writes “Sometimes the CDN cache is just messed up“, and I think that is not true. It’s not that it’s a random event you cannot influence at all. On the contrary. It’s an event which is defined by your caching configuration. It’s an event which you can control and prevent, you just need to understand how. And I think that this step of understanding and then fixing it is missing very often. And then the next from request from your authoring users for a cache flush is inevitable, and another cache flush is executed.

In the end flushing caches comes with the price of increased latency for endusers until the cache is populated again. And that’s a situation we should avoid as good as we can.

So as a conclusion:

  • An explicitly requested cache clear is a bug because it means that something is not working as expected.
  • And as every bug it should be understood and fixed, so you are no longer required to perform the workaround.

One thought on “My view on manual cache flushing

  1. Thanks for saying this – I’ve long felt the same way. How many times has someone asked me, as an ops guy, to “flush all the caches” because some content doesn’t look quite right based on what someone is seeing on their browser. You’re right in that the most expedient thing is just “flush the cache” and not to find out why the cache wasn’t being dealt with on its own.

    There are edge cases always where an explicit and immediate cache flush needs to happen. Maybe you purposefully have some very long-duration caching on elements you never change, and then suddenly you have to change them because of <<reason>>. But by and large, there shouldn’t need to be a ton of explicit cache flushing, and I’d think even in the scenario of a deployment, you could still engineer your way into not having to do a cache flush.

    It’s the same, I think, in how we always used to do CQ5 restarts with every deployment because…well, we always did. That was the only way to sidestep every bug we ever didn’t want to quash. Just restart it.

    All that being said, I too am glad for a cache flush button. ๐Ÿ™‚

Comments are closed.