Creating cachable content using selectors

The major difference between between static object and dynamically created object is that the static ones can be stored in caches; their content they contain does not depend on user or login data, date or other parameters. They look the same on every request. So caching them is a good idea to move the load off the origin system and to accelerate the request-response cycle.

A dynamically created object is influenced by certain parameters (usually username/login, permissions, date/time, but there are countless other) and therefor their content may be different from request to request. These parameters are usually specified as query parameters and must not be cached (see the HTTP 1.1 specification in RFC 2616).

But sometimes it would be great, if we could combine these 2 approaches. For example you want to offer images in 3 resolutions: small (as a preview image e.g in folder view), big (full screen view) and original (the full resolution delivered by the picture-taking device). If you decide to deliver it as static object, it’s cachable. But you need then 3 names (one for each resolution), one for each resolution. Choosing this will blur the fact that these 3 images are the same and differ only in the fact of the image resolution. It creates 3 images instead having only one in 3 instances. Amore practical drawback is that you always have to precompute these 3 pictures and place them on a reachable location. Lazy generation is hard also.

If you choose the dynamic approach, the image would be available as one object for which the instance can be created dynamically. The drawback is here that it cannot be cached.

Day Communique has the feature (the guys of Day ported it also to Apache Sling) to use so-called selectors. They behave like the query parameters one used since the stoneage of the HTTP/HTML era. But they are not query parameters, but merely encoded in the static part of the URL. So the query part of the ULR (as of HTTP 1.1) is no longer needed.

So you can use the URLs /etc/medialib/trafficjam.preview.jpg, /etc/medialib/trafficjam.big.jpg and /etc/medialib/trafficjam.original.jpg to adress the image in the 3 required resolutions. If your dispatcher doesn’t find them in its cache, it will forward the request to your CQ, which can then scale the requested image on demand. Then the dispatcher can store the image and deliver it then from its cache. That’s a very simple and efficient way to make dynamic objects static and offload requests from your application servers.

META: Being linked from Day

Today this blog was presented as link of the day on dev.day.com. Thanks for the kudos 🙂

Welcome to all who read this blog for the first time. I’ve collected some experiences with Day CQ here and will share my experience and others thoughts furtheron. Don’t hesitate to comment or drop me an email if you have a question to an post.

Please note: I am not a developer, so I cannot help you with specific question to template programming. In that case you can contact one of the groups mentioned on the Day Developer website.

Everything is content (part 2)

Recently I pointed out some differences in the handling of the “everything is content” paradigma of Communique. A few days ago I found a posting of David Nüscheler over at dev.day.com, in which he explained the details of performance tuning.

(His 5 rules do not apply exclusively to Day Communique, but to every performance tuning session).

In Rule 2 he states:

Try to implement an agile validation process in the optimization phase rather than a heavy-weight full blow testing after each iteration. This largely means that the developer implementing the optimization has a quick way to tell if the optimization actually helped reach the goal.

In my experience this isn’t viable in many cases. Of course the developer can check quickly, if his new algorithm performs better ( = is faster) than the older one. But in many cases the developer doesn’t have all the ressources  and infrastructure available and doesn’t have all the content in his test system; which is the central point why I do not trust tests performed on developer systems. So the project team relies on central environments which are built for load testing, which have loadbalancers, access to directories, production-ready sized machines and content which is comparable to the production system. Once the code is deployed, you can do loadtesting. Either using commercial software or just using something like jmeter. If you can use Continious integration and and a autodeployment system, you can do such tests every day.

Ok, where have I started? Right, “everything is content”. So you run your loadtest. You create handles, modify them, activiate and drop them, you just request a page to view, you perform acitivies in your site, and so on. Afterwards you look at your results and hopefully they are better than before. Ok. But …

But Communique is not built to forget data — of course it does sometimes, but that’s not the point here :-), so all these activities are stored. Just take a look at the default.map, the zombie.map, the cmgr.hist file, … So all your recent actions are persisted and CQ knows of them.

Of course handling more of such information doesn’t make CQ faster. In case you have long periods of time between your template updates: Check the performance data directly after a template update and compare them to the ones a few months after (assuming you don’t have CQ instances which aren’t used at all). You will see a  decrease in performance, it can be small and nearly unmeasurable, but it is there. Some actions are slower.

Ok, back to our loadtest. If you run the loadtest again and again and again, a lot of actions are persisted. When you reproduce the code and the settings of the very first loadtest and run the very first loadtest again, and you do that on a system which already faced 100 loadtests, you will see a difference. The result of this 101st loadtest is different from the first one,although the code, the settings and the loadtest are essentially the same. All the same except CQ and its memory (default.map and friends).

So, you need a mechanism which allows you to undo all changes made by such a loadtest. Only then you can perfectly reproduce every loadtest and run them 10 times without any difference in the results. I’ll try to cover such methods (there are some of them, but not all equally suitable) in some of the next posts.

And to get back to the title: Everything is content, even the history. So in contrary to my older posting,where I said:

Older versions of a handle are not content.

They are indeed content, but only when it comes to slowing down the system 🙂

Everything is content

This is the philosophy behind Communique. Everything is content. Templates are content, user data are content, user ACLs are content and — of course — content is content. Interstingly the compiled JSPs are also content, so you can remove them easily with the installation of a single package and force the JVM to recompile them again.

If all these elements can be handled the same way, you can use a single mechanism to install, update and also remove these elements. The CQ package tool is a great thing to deploy new template code (java code) plus the additional ESP files and the other static image snippets. You can access parts of the system which are not reachable by the authoring GUI. But behind the scenes it looks quite different:

  • Content (the things you have Communique for) is versioned and can be added and removed online.
  • Code isn’t versioned. If the installation of a package were an atomic operation, one could easily switch between different template version. Going back to an older template version would be quite easy, just undo the template installation and restore the version which is live then. Sadly not possible, one solves this be cleaning all template places and re-installing the older template version.
  • Day hotfixes and services: a weird construct. Because you cannot exchange them at runtime, these are extrated into a directory system/bin.new; when restarting the content of system/bin is zipped into a file system/bin.$TIMESTAMP.zip and then the content of system/bin.new is copied over to system/bin. A stable mechanism (updates are performed before the CQ core and the services actually start), but it’s really hard to undo hotfixes. No GUI anymore; you need to find the right zip files and manually unzip it to system/bin.

Oh, another thing: older versions of a handle are not content. No possibility to create a package of handles including its history (aka the versions). Only the most recent versions are included.

Create cachable content

In the last posts I described that you should use caching to increase the performance. But caching works only for static content. Most websites aren’t built on top of static pages but they are highly dynamic, presenting dynamic content which can be different for every user who’s visiting this page.

So, the first and easiest part for caching are images and media elements. Cache them! Every request not hitting your CQ instance is a good request. Youre webservers are faster than your CQ instances and scale much better. So configure the dispatcher to cache all files with the extension “.jpg”, “.png” and “.pdf” (and probably some more depending on your content), always keeping in mind that pages with such an extension must not be have different content depending on request data!

The harder part are the content pages, which may change and must be changed based on request data. Usually you cannot cache them because you don’t want to present a cached personalized page to all users requesting that url. But some optimizations are possible:

  • If the number of valid parameter values is rather slow, you can implement them as selectors. Then the page is cachable again, but think of the cache size. Don’t accept any kind of selector.
  • Reduce the effort for the system to render your dynamic page. If only a small part is based on previous input, you can split the whole rendering process into 2 parts. First rendering the dynamic parts by CQ;this is then parsed by the SSI module of your webserver which then adds the static parts to the page. Voila, only a small part is rendered directly by CQ, most parts could come out of the dispatcher cache.

I some later posting I will cover these 2 points more thoroughly.

Dispatcher caching and content structure

The dispatcher behaviour as described before has one major drawback which may have an implication on the design of your content structure.

Invalidation is performed on a whole directory and all its subtrees!

So if you invalidate a file A in a directory, the file B in the same directory is also invalidated. Also file C in the subdirectory D is invalidated. All these invalidated files need to be refetched from your CQ instance again, until the caching will work again. This happens, although you invalidated only file A. Files B and C aren’t changed at all, but they are refechted! So a single invalidation can create a lot of load on your CQ instance(s).

But this may be a side-effect, which you want to have. Because when the change of handle A may also affect handle B and C (maybe your changed the title of handle A which is required by handle B and C to render a correct navigation). This is the easiest way to invalidate all dependent handles when you change a single one.

So the point is to find a balance between permanently flushing your whole cache when you activate a single handle and taking care of your dependencies and get files marked invalid when an update should happen.

And here comes the already mentioned parameter \statfileslevel. If we assume that you have a well-designed content structure you set this parameter as high as possible to minimize the number of files which are invalidated if a single file is invalidated. On the other hand you should set it to a level so that all files are invalidated which depend on the invalidated file.

Knowing this you should arrange your content as follows:

  • Group your content and use hierarchies. This allows you increase your statfileslevel.
  • Minimize the number of dependencies between your handles so you can keep the statfileslevel high. Keep dependent handles in the same directory and try to avoid dependencies to handles in your parent directory (or even higher), so you don’t need to decrease the statfileslevel.

Note: The parameter \statfileslevel has a global scope.