Being a good citizen in multi-client-installations

Working in a group requires a kind of discipline from people, which some are not used to. I remember to a colleague, who always complained about his office mate who used to shout at the phone, even in normal talks. If people work together and share ressources, everyone expects to be cooperative and not to trash the work and morale of their team members.

The same applies to applications; if they share ressources (because they may run on the same machine), they should release the ressources if they are no longer needed, and should only claim ressources, if they’re needed at all. Consuming all CPU because “the developer was to lazy to develop a decent algorithm and just choose the brute-force solution” isn’t considered a good behaviour. It’s even harder if these applications are contained within one process, so a process crash not only affects application A, but also B and C. And then it doesn’t matter, that these are well-thought and perfectly developed.

So if you plan to deploy more than one appliction to a single CQ instance, you should take care, that the developers were aware of this condition and they had it in their mind. Because the application does no longer control the heap usage on its own (on top of the heap consumption of CQ itself), but must share it with other applications. It must be programmed with stability and robustness in mind, because unknown structures and services may change assumptions about timing and sizes. And yes, a restart also affects the others. So restarting because of application problems isn’t a good thing.

In general an application should never require a restart of the whole JVM; only when it comes to necessary changes to JVM parameters, it should be allowed. But all other application specific settings should be changable through special configuration templates which are evaluated during runtime, so changes are picked up immediately. This even reduces the amount of work for the system administrator, because changing such values can be delegated to special users using the ACL mechanism.

CQ and multi-client capability

In large companies there’s a need for a WCMS like Day CQ in many areas; of course for the company webpage, but also for some divisions offering services to internal or external customers, which either implement their application on top of Day CQ, use CQ as a proxy for other backend systems or just need a system for providing the online help.

Some of these systems are rather small, but nevertheless the business needs a system, where authors can create and modify content according to the business’ needs. If you already have CQ and knowledge in creating and operating applications, it’s quite natural to use it to satisfy these needs. The problem is, that building up and operating a CQ environment isn’t cheap for only one application. Very often one asks if CQ is capable of hosting several application within one CQ installation to leverage effets of scale. CQ is able to handle hundreds of thousands of content handles per instance, so it’s able to host 5 applications with 20 thousands handles each, isn’t it?

Well, that’s a good question. By design CQ is able to host multiple applications, enfore content separation by the ACL concept and limit the access to templates for users. Sounds good. But — as always — there are problematic details.

I will cover these problems and ways to their resolution in the next posts.

truss debugging

truss (on Linux: strace) is a really handy tool. According to it’s manual page:

The truss utility executes the specified command and produces a trace of the system calls it performs, the signals it receives, and the machine faults it incurs.

So it’s a good thing if you’re a bit familiar with system calls, which are made from a userspace programm to the kernel of your operation system. Ok, that sounds a little scary, but it is very heplfull, when you have a programm, which doesn’t work as expected, but doesn’t provide any useful information (even with increased loglevel). truss outputs the systemcall, it’s parameters (usually limited to about 20-30 characters, so it will fit on a single output line) and the return value of the system call.

Of course you don’t see the internal processing of an process (eg. for pure calculations there’s no need to use kernel functions), but very often in operation the problem isn’t the internal processing, but missing files, files in wrong directories, missing access rights and so on. These things cause problems quite often, and an experienced Unix administrator uses then truss. In most times the adminstrator just focusses on some kernel calls, which do I/O on the filesystem.

I use this tool quite often and I’m surprised, how much problems I found and resolved just by looking at the in- and output system calls; recently we had a hanging authoring cluster which doesn’t respond any more, even after having restarted both nodes. The logs showed nothing obvious, only truss revealed, that there’s was a “lock”-file, which probably was a left-over from a operation (it must have been a race-condition in the code, we haven’t found yet). And the cluster-sync of both processes waited for this lock-file to disappear, so they could write now theirselves. And of course it doesn’t, so the processes hung. So, truss showed me that these processes were repeatedly testing the existence of this file. I removed it and voila, the cluster sync went back to operation.

Using truss worked so often, that in most cases, where I/O is affected, I use truss first, even before I increase the loglevel. Colleagures also have these procedure when they have problems with their applications. We use the term “truss debugging” 😉

Oh, if you have performance issues with I/O, truss is also a nice tool. You can check, on what files I/O is performed, and with what read/write ratio. You can even see, what buffer size is used. About 3 years ago we filled a bug on CQ 3.5.5 because CQ read its default.map byte by byte; Day provided a hotfix and the startup time went down from 1 hour to about 15 minutes. The solution was to read the default.map in 2k blocks (or 4k, doesn’t matter as long it isn’t byte-wise); this issue was only noticable by either reading the source code or invoking truss.

Is CRX 1.4.2 production ready?

The contentbus technology was the standard storage backend till the CQ 4.x-series Although file-based storage wasn’t the great deal even in the late 1990s (mysql was already invented, postgres existed plus at least half a dozen enterprise database systems), Day choose to store the content objects in individual files, hidden by a abstraction layer. Of course it took some time of tuning and making experiences, but the contentbus proved to be a reliable storage which had the big point, that with an editor on the filesystem you can solve nearly all problems (we used more than once sed to fix our default.map)

But some points were still open:

Online backup isn’t possible. The documentation simply states: “Shutdown the CQ, copy your files, and startup again”. Although you can speed up the copy, if you replace with it with a snapshot on filesystem layer, but this need to restart doesn’t make it enterprise-ready. (Databases offer online-backup since at least a decade).
Contentbus requires a lot of filesystem I/O (mainly the system calls open and close). Having a lot of these operations slows down the processing. A small number of larger files would reduce this administrative overhead in the filesystem.
Memory usage of contentbus artifacts: Some artifacts like the default.map and zombie.map have in-memory data-structures, which grow as the underlying files grow (or vice-versa). The more content you have the more memory is used. Even if only a small part of this content is in active use. This doesn’t scale well.
The contentbus offers cluster support, but only with 2 nodes; with more nodes the overall performance will even degrade! According to the cluster documentation for CRX 1.4, Day tested CRX in a clustered setup with 6 nodes. If the performance loss is acceptable (that means, 6 nodes offer more performance than 5 nodes), this would be a real good solution to scale your authoring systems.

So we decided that’s time to evaluate if CRX would be at least as good as the contentbus. The TAR persistence manager adresses mainly the backup issue, we hope that we get some performance improvements as well.

So currently I’m doing a test setup of CQ 4.2.0 and CRX 1.4.2, for which Day offered (just in time :-)) a knowledge base article.

Take care of your selectors!

Recently I have shown two scenarios, where selectors can be used as a way to cache several different views of a single page. This allows one to avoid HTTP parameters quite often, reducing the load on your machines and speeding up your website.

Let’s assume that you have the already mentioned handle /etc/medialibrary/trafficjam.html and your templates support to display the image in 3 different sizes “preview”,”big” and “original”. So what does happen, if somebody chooses to request the URL “/etc/medialibrary/trafficjam.tiny.html”?

I checked some CQ-based websites and tested this behaviour. Just adding a dummy-selector. In most cases you get a proper page rendered, looking the same way as without the selector. So most templates (and also template developer) ignore the selector, if the that specific template isn’t expected to handle them. So it is good, isn’t it?

Well, in combination with the dispatcher cache it isn’t good. Because the dispatcher caches everything which is returned with an HTTP statuscode of 200 from CQ. So just adding a “foo”-selector will place another copy of the page to the dispatcher cache. This happens also with a “foo1” selector and so on. In the end the disk is full and the dispatcher cannot write any more files to the disk, but will forward every request to your CQ.

So, how can you bypass this problem? As said, the dispatcher caches only, when it receives an HTTP statuscode 200. So you need to add some code to your templates which always check the selectors. If this specific template doesn’t support any selector, fine. If called with a selector, don’t return a statuscode 200, but a 302 (permanent redirect) to the same page without any selectors or just a plain 404 (“file not found”); because calling this page with selectors isn’t a valid action and should never happen, such a statuscode is ok. The same applies when the templates supports a limited set of selectors (“preview”, “big” and “original” in the example above); just add them as a whitelist and if the given selector doesn’t match, return a 302 or 404 code.

So you don’t pollute your cache and still have the flexibility to use selectors. I think that this will outweigh the cost of adjusting your templates.

Permission sensitive caching

In the last versions of the dispatcher (starting with the 4.0.1 release) Day added a very interesting feature to the dispatcher, which allows one to cache also content on dispatcher level which are not public.

Honwai Wong of the Day support team explained it very well on the TechSummit 2008. I was a bit suprised, but I even found it on slideshare (the first half of the presentation)

Honwai explains the benefits quite well. From my experience you can reduce the load on your CQ publishers (trading a request which requires the rendering of a whole page to to a request, which just checks the ACLs of a page).

If you want to use this feature, you have to make sure that for every group or user, who has to have a individual page, the dispatcher delivers the right one. Imagine you want the present the the logged-in users the latest company news, but not logged-in users shouldn’t get them. And only the managers get the link to the the latest financial data on the startpage. So you need a startpage for 3 different groups (not-logged-in users, logged-in users, managers), and the system should deliver it appropriatly. So having a single home.html isn’t enough, you need to distinguish.

The easiest way (and the Day-way ;-)) is to use a selector denoting the group the user belongs to. So home.group-logged_in.html or home.managers.html would be good. If no selector is given, we assume the user to be an anonymous user. You have to configure the linkchecker to rewrite all links to contain the correct selector. So if a user belongs to the logged_in group and requests the home.logged_in.html page, the dispatcher will ask the CQ ” the user has the following http header lines and is requesting the home.logged_in.html, is it ok?”. CQ then checks if the given http header lines do belong to a user of the group logged_in; because he is, it responses with “200 OK, just go on”. And then the dispatcher will deliver the cached file and there’s no need for the CQ to render the same page again and again. If the users doesn’t belong to that group, CQ will detect that and send a “403 Permission denied”, and the dispatcher forwards this answer then to the user. If a user is member of more than one group, having multiple “group-“selectors is perfectly valid.

Please note: I speak of groups, not of (individual) users. I don’t think that this feature is useful when each user requires a personalized page. The cache-hit ratio is pretty low (especially if you include often-changing content on it, e.g daily news or the content of an RSS feed) and the disk consumption would be huge. If a single page is 20k and you have a version cached for 1000 users, you have a disk usage of 20 MB for a single page! And don’t forget the performance impact of a directory filled up with thousands of files. If you want to personalize pages for users, caching is inappropriate. Of course the usual nasty hacks are applicable, like requesting the user-specific data via an AJAX-call and then modifying the page in the browser using Javascript.

Another note: Currently no documentation is available on the permission sensitive caching. Only the above linked presentation of Honwai Wong.

CQ Dispatcher 4.0.3 available.

A few days ago Day released a new version of the Day CQ Dispatcher plugin. As one of the most important topics in the this release is the number of supported plattforms. The dispatcher ships now as Apache Module for Solaris Sparc and x86, both for Apache 2.0 and Apache 2.2. Finally the most relevant plattforms are supported for Apache 2.0 and 2.2 (the exception to this rule is AIX). A few bugfixes for Windows and MacOS-X plattforms and at last, again a fix for the permission sensitive caching.

Permission sensitive caching will be next topic here, so stay tuned.

Creating cachable content using selectors

The major difference between between static object and dynamically created object is that the static ones can be stored in caches; their content they contain does not depend on user or login data, date or other parameters. They look the same on every request. So caching them is a good idea to move the load off the origin system and to accelerate the request-response cycle.

A dynamically created object is influenced by certain parameters (usually username/login, permissions, date/time, but there are countless other) and therefor their content may be different from request to request. These parameters are usually specified as query parameters and must not be cached (see the HTTP 1.1 specification in RFC 2616).

But sometimes it would be great, if we could combine these 2 approaches. For example you want to offer images in 3 resolutions: small (as a preview image e.g in folder view), big (full screen view) and original (the full resolution delivered by the picture-taking device). If you decide to deliver it as static object, it’s cachable. But you need then 3 names (one for each resolution), one for each resolution. Choosing this will blur the fact that these 3 images are the same and differ only in the fact of the image resolution. It creates 3 images instead having only one in 3 instances. Amore practical drawback is that you always have to precompute these 3 pictures and place them on a reachable location. Lazy generation is hard also.

If you choose the dynamic approach, the image would be available as one object for which the instance can be created dynamically. The drawback is here that it cannot be cached.

Day Communique has the feature (the guys of Day ported it also to Apache Sling) to use so-called selectors. They behave like the query parameters one used since the stoneage of the HTTP/HTML era. But they are not query parameters, but merely encoded in the static part of the URL. So the query part of the ULR (as of HTTP 1.1) is no longer needed.

So you can use the URLs /etc/medialib/trafficjam.preview.jpg, /etc/medialib/trafficjam.big.jpg and /etc/medialib/trafficjam.original.jpg to adress the image in the 3 required resolutions. If your dispatcher doesn’t find them in its cache, it will forward the request to your CQ, which can then scale the requested image on demand. Then the dispatcher can store the image and deliver it then from its cache. That’s a very simple and efficient way to make dynamic objects static and offload requests from your application servers.

Tip: Compiling all templates

In some use-cases it’s very usefull to compile all templates without too much effort. So when you updated the templates need to have all templates compiled before going online. Or as a short test to find out if your JSP files compile properly. In that case you can use the following one-liner:

perl -ane '$templates{$F[2]}=$F[3]; END{foreach $k (keys %templates) { print "/usr/bin/curl -o /dev/null -s localhost:7042$templates{$k}.html\n";}}' < default.map | sh

This uses your default.map to find a valid handle for each templates and constructs a curl call, which is then executed by the shell. I haven’t yet checked the CRX structures for an aquivalent of the default.map to make it also work on CRX-based installations.

Joergs rules for loadtests

In the article “Everything is content (part 2)” I discussed the problems of doing proper loadtests with CQ with respect to your CQ which gets (a bit) lower by every loadtest. In a comment Jan Kuźniak proposed to disable versioning and to restore your loadtest environment for every loadtest. Thinking about Jans contribution revealed a number of topics I consider as crucial for loadtests. I collected some of them and would like to share them.

Provide a reasonable amount of data in the system. This amount should be kind of equal to your production system, so the numbers are comparable. Being 20% off doesn’t matter, but don’t expect good results if your loadtest runs on 1000 handles but your production system heads directly to 50k handles. You may optimize the wrong parts of your code then.
When you benchmarked a speedup of 20% in the loadtests but got nothing in production system, you already saw it.
When your loadtest environment is ready to run, create a backup of it. Drop the CQ loadtest installation(s) from time to time, restore it from the backup and re-run your loadtest a clean installation to verify your results. The point I already mentioned.
Always have the same configuration in the production and loadtest environment. That’s the reason why I disagree to disable versioning on the loadtesting environment. The effect of diverging configuration may be the same as in the above point: You may optimize the wrong parts of your code.
No error messages during loadtest. If an error messages indicates a code problem, it’s probably reproducable by re-running the loadtest (come on, reproducable bugs are the easiest ones to fix :-)). If it’s a content problem you should adjust your content. A loadtest is also a very basic regression test, so take the results (errors belong there also!) seriously.
Be aware of ressource virtualization! Today’s hype is to run as much applications as possible on virtualized environments (VMWare, KVM, Solaris zones, LPARs, …) to increase the efficency of the hardware usage and lower costs. Doing so very often removes some guarantees you need for comparing results of different loadtests. For exmple on one loadtest you have 4 CPUs for you, while on the second one you have 6 CPUs available. Are the results comparable? Maybe they are, maybe not.
Being limited to always 4 CPUs offers comparable loadtests, but if your production systems requires 8 CPUs, you cannot load your loadtest system with production level numbers. Getting a decent loadtest environment is a hard job …
Have good test scenarios. Of course the most basic requirement. Don’t just grab the access.log and throw it at your load injector. Re-running GET requests is easy, but forget about POSTs. Modelling good scenarios is hard and needs much time.

Of course there are a lot of more things to consider, but I will limit myself to these points at the moment. Eventually there will be a part 2.