Handling Campaign traffic in AEM

Server racks with brightly lit fiber optic cables representing data flow and a technician monitoring data on multiple screens

It must have been 2007 when I have seen that urls with a query string “utm_id=someHexCode” in the logs of the Communiqe system I ran at that time. I still remember that we had about 4’000 of them on any given day, which was not that much of a problem. But I didn’t know back then, that we will still deal with the very same requests more than 15 years later, but with an even higher rate and with more severe consequences.

What is special with these query strings? The most important thing forbackend folks is that these query strings are a frontend topic. They are used to attribute requests to a certain source, which is important for the Analytics folks to track the effectiveness for their campaigns.
For example when there is this query string “utm_id=cm2026-35-1” it could be the code of “email blast 1 of campaign 35 in 2026”. If a user clicks on that link in an email, the analytics code in the page will read this query string and report it to the analytics server. And this then allows to track the conversion rate or efficiency of this particular email blast and compare it to the results of a Facebook ad or other sources.

So this special type of traffic has 2 aspects which are important for backend folks like me:

  • It typically happens in spikes: Right after the distribution of these emails (either via ads, emails or whatever other way of distributing it) users will click it.
  • These query srings have a meaning only on the frontend side, but on the backend these parameters are not used at all.

But as most caches don’t cache any response where the query string contains a query string, such requests bypass all CDN caches by default. That means that such requests end up very frequently on the AEM publish instance for rendering, while from a backend perspective all of the following requests will produce the same results:

  • /content.html (the response of this request could be cached)
  • /content.html?utm_id=campaign1 (non-cacheable)
  • /content.html?utm_id=campaign25 (non-cacheable)

And because these requests happen frequently in spikes, this often leads to situations that such a campaign triggers an overload of the AEM Publish layer. Which is sad, because your expensive and successful marketing campaign is responsible for a server-side outage, and instead of a great experience you serve your customers a slow site and/or errors.

Unfortunately I see too many of those situations.

What is the AEM answer to it?

The general idea to handle this situation is to strip off these campaign parameters from the request, which makes turns them into requests for which the response can be served from a cache, where the usual caching and expiration rules are applied. Where and how this can be done is depends.

On AEM CS the best way to handle this directly on the CDN (using Traffic Rules to normalize requests); this is the best solution, because any campaign traffic is served directly from the CDN, and it’s not bothering origin (that means the dispatcher and the publish instances). If you are on AEM CS you should use this approach.

A concept which is can be implemented on any AEM setup is to implement it on the dispatcher. With the /ignureUrlParams command you specifiy the parameters which should be stripped from the request. If there is no query string left, the request is considered cacheable, and it’s checked against the usual dispatcher rules.

But in every case you need to be able to identify the query strings which you know will be used in the context of your AEM application. If you know them, you also know that you can ignore everything else. Configure them into the traffic rules or the /ignoreUrlParams section.

Every AEM instance should have this configured in order to handle such traffic spikes.

My view on manual cache flushing

I read the following statement by Samuel Fawaz on LinkedIn regarding the recent announcement of the self-service feature to get the API key for CDN purge for AEM as a Cloud Service:

[…] ๐˜š๐˜ฐ๐˜ฎ๐˜ฆ๐˜ต๐˜ช๐˜ฎ๐˜ฆ๐˜ด ๐˜ต๐˜ฉ๐˜ฆ ๐˜Š๐˜‹๐˜• ๐˜ค๐˜ข๐˜ค๐˜ฉ๐˜ฆ ๐˜ช๐˜ด ๐˜ซ๐˜ถ๐˜ด๐˜ต ๐˜ฎ๐˜ฆ๐˜ด๐˜ด๐˜ฆ๐˜ฅ ๐˜ถ๐˜ฑ ๐˜ข๐˜ฏ๐˜ฅ ๐˜บ๐˜ฐ๐˜ถ ๐˜ธ๐˜ข๐˜ฏ๐˜ต ๐˜ต๐˜ฐ ๐˜ค๐˜ญ๐˜ฆ๐˜ข๐˜ฏ ๐˜ฐ๐˜ถ๐˜ต ๐˜ฆ๐˜ท๐˜ฆ๐˜ณ๐˜บ๐˜ต๐˜ฉ๐˜ช๐˜ฏ๐˜จ. ๐˜•๐˜ฐ๐˜ธ ๐˜บ๐˜ฐ๐˜ถ ๐˜ค๐˜ข๐˜ฏ.

I fully agree, that a self-service for this feature was overdue. But I always wonder why an explicit cache flush (both for CDN and dispatcher) is necessary at all.

The caching rules are very simple, as the rules for the AEM as a Cloud Service CDN are all based on the TTL (time-to-live) information sent from AEM or the dispatcher configuration. The caching rules for the dispatcher are equally simple and should be well understood (I find that this blog post on the TechRevel blog covers this topic of dispatcher cache flushing quite well).

In my opinion it should be doable to build a model which allows you to make assumptions, how long it takes for a page update to be visible to all users on the CDN. And it also allows you to reason about more complex situations (especially when content is pulled from multiple pages/areas to render) and understand how and when content changes are getting visible for endusers.

But when I look at the customer requests coming in for cache flushes (CDN and dispatcher), I think that in most cases there is no clear understanding what actually happened; most often it’s just that on the authoring the content is as expected and activated properly, but this change does not show up the same way on publish. The solution is often to request a cache flush (or trigger it yourself) and hope for the best. And very often this fixes the problem, and then the most up-to-date content is delivered.

But is there an understanding why the caches were not updated properly? Honestly, I doubt that very often. The same way as infamous “Windows restart” can fix annoying, suddenly appearing problems with your computer, flushing caches seems be one of the first steps for fixing content problems. The issues goes away, we shrug and go on with our work.

But unlike in the case of Windows the situation is different here, because you have the dispatcher configuration in your git repository. And you know the rules of caching. You have everything you need to have to understand the problem better and even fix it from happening again.

Whenever the authoring users come to you with that request “content is not showing up, please flush the cache”, you should consider this situation as a bug. Because it’s a bug, as the system is not work as expected. You should apply the workaround (do the flush), but afterwards invest time into the analysis and root-cause analysis (RCA), why it happened. Understand and adjust the caching rules. Because very often these cases are well reproducible.

In his LinkedIn post Samuel writes “Sometimes the CDN cache is just messed up“, and I think that is not true. It’s not that it’s a random event you cannot influence at all. On the contrary. It’s an event which is defined by your caching configuration. It’s an event which you can control and prevent, you just need to understand how. And I think that this step of understanding and then fixing it is missing very often. And then the next from request from your authoring users for a cache flush is inevitable, and another cache flush is executed.

In the end flushing caches comes with the price of increased latency for endusers until the cache is populated again. And that’s a situation we should avoid as good as we can.

So as a conclusion:

  • An explicitly requested cache clear is a bug because it means that something is not working as expected.
  • And as every bug it should be understood and fixed, so you are no longer required to perform the workaround.

CDN and dispatcher – 2 complementary caching layers

I sometimes hear the question how to implement cache invalidation for the CDN. Or the question is why AEM CS still operates with a dispatcher layer when it now has a more powerful CDN in front of it.

The questions are very different, but the answer is in both cases: the CDN is no replacement for the dispatcher, and the dispatcher does not replace the CDN. They serve different purposes, and they combination of these two can be a really good package. Let me explain this.

The dispatcher is very traditional cache. It’s fronting the AEM systems and the cache status is actively maintained by cache invalidation so it always delivers current data. But from an end-user perspective this cache is often far away in terms of network latency. If my AEM systems are hosted in Europe, and end-users from Australia are reaching it, the latency can get huge.

The CDN is the contrary, it serves the content from many locations across the world, being as close to the end-user as possible. But the CDN cache invalidation is cumbersome, and for that reason most often TTL-based expiration is used. That means, you have to accept that there is a chance, that new content is already available, but the CDN can still deliver old content.

Not everyone is happy with that; and if that’s a real concern, short TTLs (in the range of a few minutes) are the norm. That means, that many files on the CDN will get stale every few minutes, which results in cache misses; and a cache miss on the CDN goes back to origin. But of course the reality is, that not many pages change every 10 minutes; actuallyโ€‚very few. But customers want to have that low TTL just in case a page was changed, and that change needs to get visible to all endusers as soon as possible. .

So you have a lot of cache misses on the CDN, which trigger a re-fetch of the file from origin, and and because many of the files have not changed, you refetch the exactly same binary which got stale seconds ago. Actually a waste of resources, because your origin system delivers the same content over and over again to the CDN a consequence of these misses. So you could keep your AEM instances busy all the time, re-rendering the same requests over and over, always creating the same response.

Introducing the dispatcher caching, fronting the actual AEM instance. If the file has not changed, the dispatcher will deliver the same file (or just HTTP 304 not modified, which even avoids sending the content again). And it’s fast, much faster than letting AEM rendering the same content again.โ€‚And if the file has actually changed, it’s rendered once and then reused for all the future CDN cache misses.

The combination of these 2 types of caching approaches help you to deliver content from the edge while at the same time having a reasonable latency for content updates (that means the time between replicating a change to the publish instances until all users across the world can see it) without the need to have a huge number of AEM instances in the background.

So as a conclusion, using the CDN and the dispatcher cache is a good combination, if setup properly.