Permission sensitive caching

In the last versions of the dispatcher (starting with the 4.0.1 release) Day added a very interesting feature to the dispatcher, which allows one to cache also content on dispatcher level which are not public.
Honwai Wong of the Day support team explained it very well on the TechSummit 2008. I was a bit suprised, but I even found it on slideshare (the first half of the presentation)
Honwai explains the benefits quite well. From my experience you can reduce the load on your CQ publishers (trading a request which requires the rendering of a whole page to to a request, which just checks the ACLs of a page).

If you want to use this feature, you have to make sure that for every group or user, who has to have a individual page, the dispatcher delivers the right one. Imagine you want the present the the logged-in users the latest company news, but not logged-in users shouldn’t get them. And only the managers get the link to the the latest financial data on the startpage. So you need a startpage for 3 different groups (not-logged-in users, logged-in users, managers), and the system should deliver it appropriatly. So having a single home.html isn’t enough, you need to distinguish.

The easiest way (and the Day-way ;-)) is to use a selector denoting the group the user belongs to. So home.group-logged_in.html or home.managers.html would be good. If no selector is given, we assume the user to be an anonymous user. You have to configure the linkchecker to rewrite all links to contain the correct selector. So if a user belongs to the logged_in group and requests the home.logged_in.html page, the dispatcher will ask the CQ ” the user has the following http header lines and is requesting the home.logged_in.html, is it ok?”. CQ then checks if the given http header lines do belong to a user of the group logged_in; because he is, it responses with “200 OK, just go on”. And then the dispatcher will deliver the cached file and there’s no need for the CQ to render the same page again and again. If the users doesn’t belong to that group, CQ will detect that and send a “403 Permission denied”, and the dispatcher forwards this answer then to the user. If a user is member of more than one group, having multiple “group-“selectors is perfectly valid.

Please note: I speak of groups, not of (individual) users. I don’t think that this feature is useful when each user requires a personalized page. The cache-hit ratio is pretty low (especially if you include often-changing content on it, e.g daily news or the content of an RSS feed) and the disk consumption would be huge. If a single page is 20k and you have a version cached for 1000 users, you have a disk usage of 20 MB for a single page! And don’t forget the performance impact of a directory filled up with thousands of files. If you want to personalize pages for users, caching is inappropriate. Of course the usual nasty hacks are applicable, like requesting the user-specific data via an AJAX-call and then modifying the page in the browser using Javascript.

Another note: Currently no documentation is available on the permission sensitive caching. Only the above linked presentation of Honwai Wong.

Creating cachable content using selectors

The major difference between between static object and dynamically created object is that the static ones can be stored in caches; their content they contain does not depend on user or login data, date or other parameters. They look the same on every request. So caching them is a good idea to move the load off the origin system and to accelerate the request-response cycle.

A dynamically created object is influenced by certain parameters (usually username/login, permissions, date/time, but there are countless other) and therefor their content may be different from request to request. These parameters are usually specified as query parameters and must not be cached (see the HTTP 1.1 specification in RFC 2616).

But sometimes it would be great, if we could combine these 2 approaches. For example you want to offer images in 3 resolutions: small (as a preview image e.g in folder view), big (full screen view) and original (the full resolution delivered by the picture-taking device). If you decide to deliver it as static object, it’s cachable. But you need then 3 names (one for each resolution), one for each resolution. Choosing this will blur the fact that these 3 images are the same and differ only in the fact of the image resolution. It creates 3 images instead having only one in 3 instances. Amore practical drawback is that you always have to precompute these 3 pictures and place them on a reachable location. Lazy generation is hard also.

If you choose the dynamic approach, the image would be available as one object for which the instance can be created dynamically. The drawback is here that it cannot be cached.

Day Communique has the feature (the guys of Day ported it also to Apache Sling) to use so-called selectors. They behave like the query parameters one used since the stoneage of the HTTP/HTML era. But they are not query parameters, but merely encoded in the static part of the URL. So the query part of the ULR (as of HTTP 1.1) is no longer needed.

So you can use the URLs /etc/medialib/trafficjam.preview.jpg, /etc/medialib/trafficjam.big.jpg and /etc/medialib/trafficjam.original.jpg to adress the image in the 3 required resolutions. If your dispatcher doesn’t find them in its cache, it will forward the request to your CQ, which can then scale the requested image on demand. Then the dispatcher can store the image and deliver it then from its cache. That’s a very simple and efficient way to make dynamic objects static and offload requests from your application servers.

Caching the right way

I sometimes notice that there is some kind of confusion about how content is transferred from a CQ system to the enduser, mostly regarding caches, cache invalidation and content expiration.

We must make a difference between 2 separate mechanisms:

  1. Caching as in “Communique dispatcher cache”. As already described the dispatcher cache gets only invalidated when a replication agent triggers the invalidation. There isn’t a mechanism which invalidates content after a certain amount of time.
  2. Caching as in “make use of the browser cache”. A RFC to the HTTP standard describes several mechanism to specify the timeframe in which objects are valid. Here is a more informal introduction.

So this 2 mechanism doesn’t collide; if you want to distribute your content effectivly you should use both: The dispatcher cache to lower the load on your CQ systems, and the right HTTP headers to move traffic off your systems (and your internet connection) to downstreamd proxies and browser caches.

Some remarks to the right HTTP headers:

  • If you don’t have any HTTP headers for caching, most proxies and browsers guess how long they consider an object as “live” or “valid”. Do not rely on these, control it yourself! Add the headers.
  • CQ doesn’t add any caching header by itself.
  • An very easy way to add HTTP caching headers is to configure your webserver to add them (for Apache: mod_expires is quite easy to use). Then every time your webserver delivers a object through the dispatcher (either by fetching it from CQ or by retrieving from cache) it will add these headers.