Limits of dispatcher caching with AEM as a Cloud Service

In the last blog post I proposed 5 rules for Caching with AEM, how you should design your caching strategy. Today I want to show another aspect of rule 1: Prefer caching at the CDN over caching at the dispatcher.

I already explained that the CDN is always located closer to the consumer, so the latency is lower and the experience will be better. But when we limit the scope to AEM as a Cloud Service, the situation gets a bit complicated, because the dispatcher is not able to cache files for more than 24 hours.

This is caused by a few architectural decisions done for AEM as a Cloud Service:

These 2 decisions lead to the fact, that no dispatcher cache can hold files fore more than 24 hours because the instance is terminated after that time. And there are other situations where the publishs are to be re-created, for example during deployments and up/down-scaling situations, and then the cache does not contain files for 24 hours, but maybe just 3 hours.

This naturally can limit the cache-hit ratio in cases where you have content which is requested frequently but is not changed in days/weeks or even months. In an AEM as a Cloud Service setup these files are then rendered once per day (or more often, see above) per publish/dispatcher, while in other setups (for example AMS on on-prem setups where long-living dispatcher caches are pretty much default) it can delivered from the dispatcher cache without the need to re-render it every day.

The CDN does not have this limitation. It can hold for days and weeks and deliver them, if the TTL settings allow this. But as you can control the CDN only via TTL, you have to make a tradeoff between cache-hit ratio on the CDN and the accuracy of the delivered content regarding a potential change.

That means:

  • If you have files which do not change you just set a large TTL to them and then let the CDN handle them. A good example are clientlibs (JS and CSS files), because they have a unique name (an additional selector which is created as a hash over the content of the file.
  • If there’s a chance that you make changes to such content (mostly pages), you should set a reasonable TTL (and of course “stale-while-revalidate”) and accept that your publishs need to re-render these pages when the time has passed.

That’s a bit a drawback of the AEM as a Cloud Service setup, but on the hand side your dispatcher caches are regularly cleared.

Writing integration tests for AEM (part 4)

This a part of my ongoing series about writing integration tests with AEM.

In the last post I mentioned that the URL provided to our integration tests allows us to test our dispatcher rules as well, a kind of “unit testing” the dispatcher setup. That’s what we do now.

This is the German way of saying “Stop here if you don’t have the right user-agent^Wvehicle”
Photo by Julian Hochgesang on Unsplash

As a first step we need to create a new RequestValidationClient, because we need to customize the underlying HTTP client, so it does not automatically follow HTTP redirects; otherwise it would be impossible for us to test redirects. And while we are on it, we want to customize the user-agent header as well, so it’s easier to spot the requests we do during the ingration tests. The way to customize the underlying HTTP client is documented, but a bit clumsy. But besides that this RequestValidationClient is not different from the SlingClient it’s derived from. Maybe we change that later.

The actual integration tests are in PublishRedirectsIT. Here I use this RequestValidationClient to perform unauthenticated requests (as end-users typically do) against the publish instance. To illustrate the testing of the client, there are 3 tests:

  • In the testInitialRedirectAndHomepage method it is validated, that a request to “/” will result in a permanent redirect to /en/us.html. Additionally it is made sure that /us/en.html is actually present and returns a 200.
  • A second test is hitting /system/console, which must never be exposed to the internet.
  • A third test ensures, that the default get servlet is properly secured, so that the infamous “infinity” selector for the JSON extension is returning a 404.

With this approach it is possible to validate that that complete security checklist of the dispatcher is actually implemented and that all “invalid” urls are properly blocked.

Some remarks to the PublishRedirectIT implementation itself:

  • Also here the tests are a bit clumsier than they could be. First, because the recommended ways to perform a HTTP request always have a “expectedReturnCode” parameter, which is unfortunate because we want to perform this test ourself. For that reason I build a small workaround to accept all status codes. The testing clients should offer that natively though.
  • And secondly, I encountered problems with the authentication on the publish. And that’s the reason why the creation of the anonymousPublish is how it is.

But anyway, that’s a neat approach to validate that your dispatcher setup is properly done. And of course you could also use the JsoupClient to test a page on publish as well.

Some remarks if you want to execute these tests in your system: I adjusted the configuration of the “dispatcher” module of the repository as well, so you can easily use it together with the dispatcher docker image (check out this fantastic documentation).

That’s it for today, happy testing!

Creating the content architecture with AEM

In the last post I tried to describe the difference between the information architecture and content architecture; and from an architectural point of the view the content architecture is quite important, because based on that your application design will emerge. But how can you get to a stable and well-thought content structure?

Well, there’s no bullet-proof approach for it. When you design the content architecture for an AEM-based application it’s best to have some experience with the hierarchical approach offered by the repository approach. I will try to outline a process which might help you to get you there.
It’s not a definite guideline and I will never guarantee that it will work for you, as it is just based on my experience with the projects I did. But I hope that it will give some input and can act as a kind of checklist for you. My colleague Alex Klimetschek did a presentation at the adaptTo() conference 2012 about it.

The tree

But before we start, I want to remind you of the fact, that everything you do has to fit into the JCR tree. This tree is typically a big help, because we often think in trees (think of decision trees, divide-and-conquer algorithms, etc), also the URL is organized in a tree-ish way. Many people in IT are familiar with the hierarchical way filesystems are organized, so it’s both an comfortable and easy-to-explain approach.

Of course there are cases, where it makes things hard to model; but you are hit that problem, you should try to choose a different approach. Building any n:m relation in the AEM content tree is counter-intuitive, hard to implement and typically not really performant.

Start with the navigation

Coming from the information architecture you typically have some idea, how the navigation in the site should look like. In the typical AEM-based site, the navigation is based on the content tree; that means that traversing the first 2-3 levels of your site tree will create the navigation (tree). If you map it the other way around, you can get from the navigation to the site tree as well.

This definition definitivly has impact on your site, as now the navigation is tied to your content structure; changing one without the other is hard. So make your decision carefully.

Consider content-reuse

As the next step consider the parts of the website, which have to be identical, e.g. header and footer. You should organize your content in a way, that these central parts are maintained once for the whole site. And that any change on them can be inherited down the content tree. When you choose this approach, it’s also very easy to implement a feature, which allows you to change that content at every level, and inherit the changed content down the tree, effectively breaking the inheritance at this point.

If you are this level, also consider the fact of dispatcher invalidation. Whenever you change such a “centralized” content, it should be easily possible to purge the dispatcher cache; in the best case the activation of the changed content will trigger the invalidation of all affected pages (not more!), assuming that you have your /statefilelevel parameter set correctly.

Consider access control

As third step let’s consider the already existing structure under the aspect of access control, which you will need on the authoring environment.
On smaller sites this topic isn’t that important, because you have only a single content team, which maintains all the page. But especially in larger organizations you have multiple teams, and each team is responsible for dedicated parts of the site.

When you design your content structure, overlay the content structure with these authoring teams, and make sure, that you can avoid any situation, where a principal has write access to a page, but not to any of the child pages. While this is not always possible, try to follow this guidelines regarding access control:

  • When looking from the root node in the tree to node on a lower level, always add more privileges, but do not remove them.
  • Every author for that site should have read access to the whole site.

If you have a very complicated ACL setup (and you’ve already failed to make it simpler), consider to change your content structure at this point, and give the ACL setup a higher focus than for example the navigation.

My advice at this point: Try to make your ACL setup very easy; the more complex it gets the more time you will spend in debugging your group and permission setup to find out, what’s going on in a certain situation; also the harder it will be to explain it to your authors.

Multi-Site with MSM

As you went now through these 3 steps, you are through with it and already have some idea how your final content structure needs to look like. There is another layer of complexity if you need to maintain multiple sites using the multi-site-manager (MSM). The MSM allows you to inherit content and content structure to another site, which is typically located in a parallel sub-tree of the overall content tree. Choosing the MSM will keep your content structures consistent, which also means, that you need to plan and setup your content master (in MSM terms it is called the blueprint) in a way, that the resulting structure is well-suited for all copies of it (in MSM: live copies).

And on top of the MSM you can add more specifics, features and requirements, which also influence the content structure of your site. But let’s finish here for the moment.

When you are done with all these exercises, you already have a solid basis and considered a lot of relevant aspects. Nevertheless you should still ask others for a second opinion. Scrutiny pays really off here, because you are likely to live with this structure for a while.

Custom dispatcher invalidation rules

Most web sites use different approaches of caching to improve their performance, some within the application and some on the delivery side of the site, such as a content delivery network (CDN) or a simple reverse proxy like squid or varnish. Of course all of these can also be used together with CQ5, the default caching approach on the delivery side of CQ5 is the dispatcher.

The dispatcher is a pretty straight-forward cache, which has some built-in rules for invalidation. The most important rule is: Whenever a page is invalidated, all pages, which are located deeper in the content hierarchy below this page, are invalidated as well.

Some background: This might be based on the assumption, that a change on any page can influence the appearance of pages higher in the hierarchy; you should think of a change of the page title, which must be reflected in the navigation items of all other pages in that area. Combined with the /statfileslevel parameter you should be able to deliver always a correctly rendered page and never outdated content. At the cost of the cache hit ratio in the dispatcher.

I already described the inner working of the dispatcher in an older article. Knowing these internals also give you some more options if you want to tune this. In the following I describe an approach, which allows you to implement custom invalidation rules. The caching and the delivery of the cached files doesn’t change, only the cache invalidation.

The basic idea is to replace the call to /dispatcher/invalidate.cache (which is the standard handler for cache invalidation) by a custom script, which runs the invalidation on the cache. This custom script is configured as target URL in the invalidation agent(s) and updates the “last modification date” of the statfile according to your requirement.

For example you can have this perl script, which should behave like the dispatcher invalidation logic:

/usr/bin/perl -w
use strict;
use CGI;
use File::Find;
use File::Touch;
my $cacheroot = "/opt/www/docroot";
my $q = CGI->new;
my $path = q->param("Handle");
my $invalidation_path = $cacheroot . $path;
File::Find::find({wanted => \&wanted}, $invalidation_path);
exit;

sub wanted {
  my $f = shift;
  if ($f =~ /\.stat/) {
    touch ($f);
  }
}

Configure this script as a CGI on “/dispatcher/custom.invalidation”, then configure your invalidation accordingly, and voila, you’re done. You run your custom invalidation script. If you don’t like the standard behaviour (obviously you don’t …), adjust the logic in the wanted procedure.

Take care of your selectors!

Recently I have shown two scenarios, where selectors can be used as a way to cache several different views of a single page. This allows one to avoid HTTP parameters quite often, reducing the load on your machines and speeding up your website.

Let’s assume that you have the already mentioned handle /etc/medialibrary/trafficjam.html and your templates support to display the image in 3 different sizes “preview”,”big” and “original”. So what does happen, if somebody chooses to request the URL “/etc/medialibrary/trafficjam.tiny.html”?

I checked some CQ-based websites and tested this behaviour. Just adding a dummy-selector. In most cases you get a proper page rendered, looking the same way as without the selector. So most templates (and also template developer) ignore the selector, if the that specific template isn’t expected to handle them. So it is good, isn’t it?

Well, in combination with the dispatcher cache it isn’t good. Because the dispatcher caches everything which is returned with an HTTP statuscode of 200 from CQ. So just adding a “foo”-selector will place another copy of the page to the dispatcher cache. This happens also with a “foo1” selector and so on. In the end the disk is full and the dispatcher cannot write any more files to the disk, but will forward every request to your CQ.

So, how can you bypass this problem? As said, the dispatcher caches only, when it receives an HTTP statuscode 200. So you need to add some code to your templates which always check the selectors. If this specific template doesn’t support any selector, fine. If called with a selector, don’t return a statuscode 200, but a 302 (permanent redirect) to the same page without any selectors or just a plain 404 (“file not found”); because calling this page with selectors isn’t a valid action and should never happen, such a statuscode is ok. The same applies when the templates supports a limited set of selectors (“preview”, “big” and “original” in the example above); just add them as a whitelist and if the given selector doesn’t match, return a 302 or 404 code.

So you don’t pollute your cache and still have the flexibility to use selectors. I think that this will outweigh the cost of adjusting your templates.

Permission sensitive caching

In the last versions of the dispatcher (starting with the 4.0.1 release) Day added a very interesting feature to the dispatcher, which allows one to cache also content on dispatcher level which are not public.
Honwai Wong of the Day support team explained it very well on the TechSummit 2008. I was a bit suprised, but I even found it on slideshare (the first half of the presentation)
Honwai explains the benefits quite well. From my experience you can reduce the load on your CQ publishers (trading a request which requires the rendering of a whole page to to a request, which just checks the ACLs of a page).

If you want to use this feature, you have to make sure that for every group or user, who has to have a individual page, the dispatcher delivers the right one. Imagine you want the present the the logged-in users the latest company news, but not logged-in users shouldn’t get them. And only the managers get the link to the the latest financial data on the startpage. So you need a startpage for 3 different groups (not-logged-in users, logged-in users, managers), and the system should deliver it appropriatly. So having a single home.html isn’t enough, you need to distinguish.

The easiest way (and the Day-way ;-)) is to use a selector denoting the group the user belongs to. So home.group-logged_in.html or home.managers.html would be good. If no selector is given, we assume the user to be an anonymous user. You have to configure the linkchecker to rewrite all links to contain the correct selector. So if a user belongs to the logged_in group and requests the home.logged_in.html page, the dispatcher will ask the CQ ” the user has the following http header lines and is requesting the home.logged_in.html, is it ok?”. CQ then checks if the given http header lines do belong to a user of the group logged_in; because he is, it responses with “200 OK, just go on”. And then the dispatcher will deliver the cached file and there’s no need for the CQ to render the same page again and again. If the users doesn’t belong to that group, CQ will detect that and send a “403 Permission denied”, and the dispatcher forwards this answer then to the user. If a user is member of more than one group, having multiple “group-“selectors is perfectly valid.

Please note: I speak of groups, not of (individual) users. I don’t think that this feature is useful when each user requires a personalized page. The cache-hit ratio is pretty low (especially if you include often-changing content on it, e.g daily news or the content of an RSS feed) and the disk consumption would be huge. If a single page is 20k and you have a version cached for 1000 users, you have a disk usage of 20 MB for a single page! And don’t forget the performance impact of a directory filled up with thousands of files. If you want to personalize pages for users, caching is inappropriate. Of course the usual nasty hacks are applicable, like requesting the user-specific data via an AJAX-call and then modifying the page in the browser using Javascript.

Another note: Currently no documentation is available on the permission sensitive caching. Only the above linked presentation of Honwai Wong.

CQ Dispatcher 4.0.3 available.

A few days ago Day released a new version of the Day CQ Dispatcher plugin. As one of the most important topics in the this release is the number of supported plattforms. The dispatcher ships now as Apache Module for Solaris Sparc and x86, both for Apache 2.0 and Apache 2.2.  Finally the most relevant plattforms are supported for Apache 2.0 and  2.2 (the exception to this rule is AIX). A few bugfixes for Windows and MacOS-X plattforms and at last, again a fix for the permission sensitive caching.

Permission sensitive caching will be next topic here, so stay tuned.

Creating cachable content using selectors

The major difference between between static object and dynamically created object is that the static ones can be stored in caches; their content they contain does not depend on user or login data, date or other parameters. They look the same on every request. So caching them is a good idea to move the load off the origin system and to accelerate the request-response cycle.

A dynamically created object is influenced by certain parameters (usually username/login, permissions, date/time, but there are countless other) and therefor their content may be different from request to request. These parameters are usually specified as query parameters and must not be cached (see the HTTP 1.1 specification in RFC 2616).

But sometimes it would be great, if we could combine these 2 approaches. For example you want to offer images in 3 resolutions: small (as a preview image e.g in folder view), big (full screen view) and original (the full resolution delivered by the picture-taking device). If you decide to deliver it as static object, it’s cachable. But you need then 3 names (one for each resolution), one for each resolution. Choosing this will blur the fact that these 3 images are the same and differ only in the fact of the image resolution. It creates 3 images instead having only one in 3 instances. Amore practical drawback is that you always have to precompute these 3 pictures and place them on a reachable location. Lazy generation is hard also.

If you choose the dynamic approach, the image would be available as one object for which the instance can be created dynamically. The drawback is here that it cannot be cached.

Day Communique has the feature (the guys of Day ported it also to Apache Sling) to use so-called selectors. They behave like the query parameters one used since the stoneage of the HTTP/HTML era. But they are not query parameters, but merely encoded in the static part of the URL. So the query part of the ULR (as of HTTP 1.1) is no longer needed.

So you can use the URLs /etc/medialib/trafficjam.preview.jpg, /etc/medialib/trafficjam.big.jpg and /etc/medialib/trafficjam.original.jpg to adress the image in the 3 required resolutions. If your dispatcher doesn’t find them in its cache, it will forward the request to your CQ, which can then scale the requested image on demand. Then the dispatcher can store the image and deliver it then from its cache. That’s a very simple and efficient way to make dynamic objects static and offload requests from your application servers.

Caching the right way

I sometimes notice that there is some kind of confusion about how content is transferred from a CQ system to the enduser, mostly regarding caches, cache invalidation and content expiration.

We must make a difference between 2 separate mechanisms:

  1. Caching as in “Communique dispatcher cache”. As already described the dispatcher cache gets only invalidated when a replication agent triggers the invalidation. There isn’t a mechanism which invalidates content after a certain amount of time.
  2. Caching as in “make use of the browser cache”. A RFC to the HTTP standard describes several mechanism to specify the timeframe in which objects are valid. Here is a more informal introduction.

So this 2 mechanism doesn’t collide; if you want to distribute your content effectivly you should use both: The dispatcher cache to lower the load on your CQ systems, and the right HTTP headers to move traffic off your systems (and your internet connection) to downstreamd proxies and browser caches.

Some remarks to the right HTTP headers:

  • If you don’t have any HTTP headers for caching, most proxies and browsers guess how long they consider an object as “live” or “valid”. Do not rely on these, control it yourself! Add the headers.
  • CQ doesn’t add any caching header by itself.
  • An very easy way to add HTTP caching headers is to configure your webserver to add them (for Apache: mod_expires is quite easy to use). Then every time your webserver delivers a object through the dispatcher (either by fetching it from CQ or by retrieving from cache) it will add these headers.

Hints on performance (part 2)

For curiosity I often take a look into the the HTTP headers of websites I visit (I use the great Firefox plugin HTTP Live Headers for it). On some major websites I discovered that these don’t make use of HTTP pipelining, which is nowadays a major performance drawback, since today’s website include much more items (images, CSS, Javascripts) than a website of 1998.

Quoting http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html:

For all our tests, a pipelined HTTP/1.1 implementation outperformed HTTP/1.0, even when the HTTP/1.0 implementation used multiple connections in parallel, under all network environments tested. The savings were at least a factor of two, and sometimes as much as a factor of ten, in terms of packets transmitted.

This point isn’t directly related to Day Communique, but can be aplied to all webpages. Take your browser and check if your site makes use of HTTP 1.1 pipelining. How?

Well, that’s pretty easy: Take your browser (I suggest Firefox and the above mentioned plugin Live HTTP Headers), open the plugin and then goto your website. Then check the answers if they contain the line “Connection: closed”; whenenver you see this line, it means that your browser must open a new TCP connection to fetch another file from the server. In the best case you should not find this header at all. If you find such a line, you should really sit down and try to get rid of it.

2 remarks to the dispatcher:

  1. The apache webserver can deliver files from cache or from the dispatcher without breaking the HTTP pipelining. So here it doesn’t matter if a file is taken from the cache or fetched from CQ; if you configured your Apache correctly, you’ll never get a “Connection: closed”.
  2. The dispatcher itself also fetches files using HTTP pipelining by default. You can force it not to do so, but I don’t recommend it. In a version before dispatcher 4.0 this behaviour was broken, but in the most recent versions it works perfectly. And of course: The servlet engine bundled with CQ 3.5.5 and newer supports HTTP pipelining out of the box.

For further reading I recommend Aaron Hopkins’ “Optimizing Page Load Times” and for general performance hints the Best Practices for Speeding Up Your Website by Yahoo.