Disk usage

Providing enough free disk space is one of the most urgent task for every system administrator. Applications often behave strange when the disks runs out of free space and sometimes data gets corrupt if the application is badly written.

Under Unix the 2 tools to determine the usage of the disk are “du” (disk usage) and “df” (disk free). With “du” you can determine the amount of space a certain directory (with the files and subdirectories in it) consume. “df” shows on a global level the amount of used and free disk space per partition.

Given the case, that you give the same directory (which is also a mountpoint) to both “du” and “df” as parameters. This directory contains a full CQ application with content and versions. You will probably get different results. When we did it, “df” showed about 570 gigabyte used disk space, but “du” claimed, that only 480 gigabyte are used. We checked for hidden directories, open files, snapshots and other things, but the difference of about 90 gigabyte remained.

This phenomenon can be explained quite easy. “du” accumulates the size of files. So if a file has 120 bytes in size, it adds 120 to the sum. “df” behaves differently, it counts block-wise, which are the smallest allocation unit of a unix filesystem (today most blocks are 512 bytes large by default). Because the only one file per block is possible, the 120-byte file uses a full block, leaving 392 bytes unused in that block.

Usually this behaviour is not apparant, because the number of files is usually rather small (a few to some ten thousand) and they are large, so the unused parts are at max 1 percent of the whole file size. But if you have a CQ contentbus with several hundert thousands of files (content + versions) with a lot of small files, this part can grow to a level, where you’d ask your system administrator, where the storage is gone.

So dear system administrator, there’s no need to move your private backup off the server, just tell the business, that their unique content structure needs a lot of additional disk space. 🙂

truss debugging

truss (on Linux: strace) is a really handy tool. According to it’s manual page:

The truss utility executes the specified command and produces a trace of the system calls it performs, the signals it receives, and the machine faults it incurs.

So it’s a good thing if you’re a bit familiar with system calls, which are made from a userspace programm to the kernel of your operation system. Ok, that sounds a little scary, but it is very heplfull, when you have a programm, which doesn’t work as expected, but doesn’t provide any useful information (even with increased loglevel). truss outputs the systemcall, it’s parameters (usually limited to about 20-30 characters, so it will fit on a single output line) and the return value of the system call.

Of course you don’t see the internal processing of an process (eg. for pure calculations there’s no need to use kernel functions), but very often in operation the problem isn’t the internal processing, but missing files, files in wrong directories, missing access rights and so on. These things cause problems quite often, and an experienced Unix administrator uses then truss. In most times the adminstrator just focusses on some kernel calls, which do I/O on the filesystem.

I use this tool quite often and I’m surprised, how much problems I found and resolved just by looking at the in- and output system calls; recently we had a hanging authoring cluster which doesn’t respond any more, even after having restarted both nodes. The logs showed nothing obvious, only truss revealed, that there’s was a “lock”-file, which probably was a left-over from a operation (it must have been a race-condition in the code, we haven’t found yet). And the cluster-sync of both processes waited for this lock-file to disappear, so they could write now theirselves. And of course it doesn’t, so the processes hung. So, truss showed me that these processes were repeatedly testing the existence of this file. I removed it and voila, the cluster sync went back to operation.

Using truss worked so often, that in most cases, where I/O is affected, I use truss first, even before I increase the loglevel. Colleagures also have these procedure when they have problems with their applications. We use the term “truss debugging” 😉

Oh, if you have performance issues with I/O, truss is also a nice tool. You can check, on what files I/O is performed, and with what read/write ratio. You can even see, what buffer size is used. About 3 years ago we filled a bug on CQ 3.5.5 because CQ read its default.map byte by byte; Day provided a hotfix and the startup time went down from 1 hour to about 15 minutes. The solution was to read the default.map in 2k blocks (or 4k, doesn’t matter as long it isn’t byte-wise); this issue was only noticable by either reading the source code or invoking truss.

Tip: Compiling all templates

In some use-cases it’s very usefull to compile all templates without too much effort. So when you updated the templates need to have all templates compiled before going online. Or as a short test to find out if your JSP files compile properly. In that case you can use the following one-liner:

perl -ane '$templates{$F[2]}=$F[3]; END{foreach $k (keys %templates) { print "/usr/bin/curl -o /dev/null -s localhost:7042$templates{$k}.html\n";}}' < default.map | sh

This uses your default.map to find a valid handle for each templates and constructs a curl call, which is then executed by the shell. I haven’t yet checked the CRX structures for an aquivalent of the default.map to make it also work on CRX-based installations.

Visualize your requests

In the last year, customers often complained about our bad performance. We had just fixed a small memory leak (which crashed our publishing instances about every hour or so), so we were quite interested in getting reliable data to confirm or deny their anger. That time I thought that we need to have a possibility to get a quick overview of the performance of our CQ instances. One look to see “Ok, it must be the network, our systems perform great!”

So I dug out my perl knowhow and wrote a little script which parses through a request.log and prints out data which which is understood by gnuplot. And gnuplot draws then some nice graphs of it. It displays the number of requests per minute and also the average request duration for these requests.

(Click on the image for a larger version.)

These images proved themselves as pretty useful, because you show them to your manager (“Look, the average response went down from 800 miliseconds to 600 although the number of requests went up by 30%.”) and they help you in daily bussiness, because you can spot problems quite well. When at a certain time the response times go up, you better had a look at the system and find the reason for it.

Because this scripts is quite fast (it parses 300 megabytes of request.log in about 15 seconds on a fast Opteron-based machine), we usually render these images online and integrate the resulting images in a small web application (no CQ but a small hacked up PHP script). For some more interactivity I added the possibility to display only the requests which matches a certain string (click on the image to view a larger version). So it’s very easy to answer questions such “Is the performance of my landing page that bad as customer report?”

You can download this little perl script here. Run it with “–help” first and it will display a little help screen. Give a number of request.log files as parameter to it, pipe the output directly into gnuplot (I tested with version 4.0, but will probably also work with newer versions) and it will output a png file. Adjust the scripts to your needs and contribute back, I released it under GPL version 2.

(For the hackers: Some things can probably be performed better and I also have some new functionality already prepared in it, but not active. Patches are welcome :-))

Tip: Lock out the users

From time to time you need to perform maintenance works on your systems; on such highly configurable systems like Communique these tasks can often be performed online while the authors are working. But some task require that there’s no activity on the system; when you want to reorganize your replication agents, it’s a very good feeling when nobody is trying to activate the super-duper important company news. So locking out the users is required.

A very easy method is available, when only superuser should have access. Because the superuser has no ACL, just add a “DENY all” ACL for all users. And the easiest way to do this is to add this ACL to the “post” user. Then the last ACL for every is a DENY to all handles. This effectivly forbids logins, forbids displaying any content for all users but the superuser.

Everything is content

This is the philosophy behind Communique. Everything is content. Templates are content, user data are content, user ACLs are content and — of course — content is content. Interstingly the compiled JSPs are also content, so you can remove them easily with the installation of a single package and force the JVM to recompile them again.

If all these elements can be handled the same way, you can use a single mechanism to install, update and also remove these elements. The CQ package tool is a great thing to deploy new template code (java code) plus the additional ESP files and the other static image snippets. You can access parts of the system which are not reachable by the authoring GUI. But behind the scenes it looks quite different:

Content (the things you have Communique for) is versioned and can be added and removed online.
Code isn’t versioned. If the installation of a package were an atomic operation, one could easily switch between different template version. Going back to an older template version would be quite easy, just undo the template installation and restore the version which is live then. Sadly not possible, one solves this be cleaning all template places and re-installing the older template version.
Day hotfixes and services: a weird construct. Because you cannot exchange them at runtime, these are extrated into a directory system/bin.new; when restarting the content of system/bin is zipped into a file system/bin.$TIMESTAMP.zip and then the content of system/bin.new is copied over to system/bin. A stable mechanism (updates are performed before the CQ core and the services actually start), but it’s really hard to undo hotfixes. No GUI anymore; you need to find the right zip files and manually unzip it to system/bin.

Oh, another thing: older versions of a handle are not content. No possibility to create a package of handles including its history (aka the versions). Only the most recent versions are included.