Is my CRX performance I/O bound?

In the last months I’ve seen many situations, where the CQ5 performance was poor, while the CPU usage was quite low (on a 16 core machine 15 cores were idling). The next assumption, that the I/O is the problem, wasn’t confirmed by the tools like top, iostat or sar, because they showed an I/O wait of 3-4%, which indicates that there’s I/O, but the system is not loaded with it.

Further investigation using threaddumps and profiler showed, that there was indeed a I/O problem. Because they showed a lot of blocked threads within the JVM, which wanted to do I/O. So how goes this together?

Basically it can be easily explained. Your operation system is optimized to handle multiple parallel threads/processes, which want to do I/O. If there are too much of them or the I/O subsystem is too slow, the I/O wait ratio will increase. But in the CRX case it’s a bit different. Because of its internal structures CRX requires locking to synchronize its write actions to disk (reads are done in parallel). So for many operations (like updating metadata, writing journal log etc) only 1 thread can actually write to the filesystem, all others are waiting for this thread to finish its write action. For the operating system this looks like a single write action, which can be handled quite easily without the I/O wait skyrocketing.

So if you suspect that you have an I/O problem, use a profiler or threaddumps (or /crx/diagnostic/prof.jsp), the data which are displayed by top don’t tell the whole truth.

CRX 2.1: Improved backup

CRX 2.1 introduced no major changes, but only smaller improvements in many places. In simple geometrixx-based performance tests you can get a speedup of about 30% (at least that’s the result I benchmarked with CRX 2.1 and CRX Hotfix 2.1.0.5), just because of improvements in Jackrabbit and CRX code. So, going to CRX 2.1 is a recommendation I would make for every CQ 5.3 project.

And next to these improvements a small extension to the online backup was introduced. Up to and including CRX 2.0 you backuped to a zip file (which contains the complete quickstart and the repository). In CRX 2.1 it is no longer required to zip the backuped data, but the backup procedure can simply drop it into a directory. This directory still contains all data (quickstart and the repository), so you run the zip programm yourself if you want. So, what’s the great deal here?

First of all, your backup software can do an incremental backup on this directory. Especially the tar files and the index will change quite often (think of the TarPM optimizer), but the datastore files won’t change, so they are ideal candidates for incremental backup.

Speaking of the datastore and its immutable files: Because files are only added to the datastore (and not changed), you can optimize the backup. The online backup backups the data based on its knowledge of the directory, where the quickstart is placed in. Everything in that directory and beneath is covered by the backup process. If you move the datastore to a directory outside the directory, where the quickstart.jar resides, the datastore isn’t backupped. Why is that good? Because if we always our backup in the same directory, we can use rsync to backup the datastore. Rsync is much better suited for this job, because it works incremental (CRX online backup doesn’t) and works outside of the CRX/CQ5 java process (which is always good). Remember: Files in the datastore are only added, not changed!

I recommend to change your CQ5 filesystem layout as follows:

cq5
 - backup
   - datastore
   - cq5
     - cq-wcm-quickstart.jar
     - crx-quickstart
     - ...
     - license.properties
 - datastore
 - cq5
   - cq-wcm-quickstart.jar
   - crx-quickstart
   - ...
   - license.properties

This layout ensures, that you would be able to start the CQ5 directly from the backup directory (altough you shouldn’t do it) or simply move it to the “production” location, without any configuration changes.

Moving the datastore is easy: Shutdown CQ5/CRX  and replace the line
<DataStore class="com.day.crx.core.data.ClusterDataStore"/>

in the repository.xml with the following ones:

<DataStore class="com.day.crx.core.data.ClusterDataStore">
<param name="path" value="../../../../backup/datastore" />
</DataStore>

I recommend to use a relative adress here; because it’s relative to the directory crx-quickstart/repository/repository, you should enter
"../../../../datastore” here. Adding a symlink won’t help!

Then create this directory structure and move all the datastore files there (crx-quickstart/repository/repository/datastore/*) and make sure, that the ACLs are set properly. Startup and have fun!

And now the rough skeleton of a backup script to illustrate the process; it requires curl, rsync and a little perl. For the windows-based systems I cannot give much advice; probably there is also some tool like rsync.


#!/bin/sh

HOST=localhost:4502
ADMINPW=admin # must be URL encoded!!

BACKUP_FILENAME="" # we want CRX to backup to the directory, no zip file!
BACKUP_DIR=/opt/cq5/backup
INSTANCE_DIR=/opt/cq5

COOKIE=cookie.txt
CURLPARAMETERS="-s -S"

# hacky, using a appropriate perl module would be better
ENCODED_BACKUP_DIR=`echo "${BACKUP_DIR} | perl -pe 's|\/|%2F|g'`
touch ${COOKIE}
chmod 600 ${COOKIE}
echo "Start: `date`"
curl -c ${COOKIE} ${CURLPARAMETERS} "http://${HOST}/crx/login.jsp?UserId=admin&Password=$ADMINPW&Workspace=crx.default" > /dev/null
curl -b ${COOKIE} ${CURLPARAMETERS} -o progress.txt "http://${HOST}/crx/config/backup.jsp?action=add&targetDir=${ENCODED_BACKUP_DIR}&zipFileName=${BACKUP_FILENAME}" > /dev/null
rm progress.txt
rm ${COOKIE}

echo "Syncing datastore"
DATASTORE_DIR="${INSTANCE_DIR}/datastore/"
DATASTORE_BACKUP_DIR="${BACKUP_DIR}/datastore"
rsync -a ${DATASTORE_DIR} ${DATASTORE_BACKUP_DIR}
echo "Finished: `date`"

Update:Fixed path in the repository.xml snippet. Thanks Thomas.

CQ 5.4: New features for sysadmins

Today the release announcement of CQ 5.4 has been published. I collected some changes since CQ 5.3, which are intersting to all of you guys, who will be involved in operating it.

  • CRX 2.2 supports the shared-nothing clustering.
  • CRXDE light is no longer a dedicated web application, but is included into the CRX webapp.
  • The CQSE serverctl script has been cleaned and simplified; so the description in “Bootstrapping the Java process” does no longer apply. Most of all the “serverctl psmon” process is gone now.
  • Since CRX 2.1 it is possible to perform a backup to a directory, not only to a ZIP file. This offers some more space for optimization. Expect a follow-up article on this.
  • The CRX Quickstart requires you to have a “ulimit -n” (number of open files) of at least 8192. It will probably affect your unpack run, but neither CQSE nor CQ5 itself does enforce it then …
  • It is still possible to build CQ5 custom images
  • CQ 5.4 offers an enhanced replication: There are these new options to a replication agent (on the “Trigger” tab): “no status update” and  “no versioning”. Let’s try to build a shadow copy of an author instance!

That’s for the moment. Have fun with CQ 5.4!

Maintenance mode

I just stumbled over my old article on locking out users and felt, that it is a bit outdated. The mechanism described there is only suitable for CQ3 and CQ4, but is not applicable for CQ5, because there is no “post” user, and the complete access control mechanism has changed.

In CQ5 it is incredibly easy to install ServletFilters (thanks OSGI and Declarative Services); so I wrote a small servlet filter, which blocks requests originating from users, which are not whitelisted. That’s a nice solution, which does not require any intrusive operation such as changing ACLs or such. You just need to deploy a tiny little bundle, put “admin” on the whitelist and enable the maintenance in the Felix webconsole. That’s it.

I will submit this package (source code plus compiled bundle) to the Day package share, licensed under Apache 2.0 License. It may take a bit, but I will place it to the public area, so you can grab it and study the source (it’s essentially only the servlet class).

Building custom CQ5 installation images

Very often one needs to setup a number of CQ5 installations with the same featuresets; e.g if you start with a bunch of new publishing instances or you need to update your development environments with a new set of hotfixes.

One way is to provide a detailled list of instructions plus the required files to the people responsible for it. It’s important to be consistent over all affected installations and environments, so you can remedy problems and issues because of missing fixes or wrong installation. But then a lot of manual work is included, which isn’t the thing IT people want to do.

I needed to provide several CQ5 installations in the last time. Because my standard installation recommendation consists of CQ 5.3 plus CRX 2.1 plus performancepack 30015 (using CQSE and the TarPM) at the moment, just deploying a CQ 5.3 the usual way isn’t sufficient. But on the other hand I don’t want to have the work of a manual installation of CRX 2.1 and the performancepack on top of a default CQ 5.3, both including restarts.

So I decided to build an image, which contains all these components, without the need for an restart, without fiddling around with the package manager, just by using some hidden features of CQ5.3 and CRX: the package installer and the flawless upgrade procedure of CRX 2.x (x={0,1}, will probably work also for later versions of CRX). You can find the documentation of the upgrade process also on the official documentation site.

1.) Unpack a plain CQ 5.3:

$ cd cq530
$ java -jar cq-wcm-quickstart-author-5.3.0.jar -unpack

2.) Get CRX 2.1 and unpack it:

$ cd crx21
$ java -jar crx-2.1.0.20100426-enterprise.jar -unpack

3.) Copy the CRX webapplication file of CRX 2.1 into the unpacked CQ 5.3 installation:

$ cp crx21/crx-quickstart/server/webapps/crx-explorer_crx.war cq530/crx-quickstart/server/webapps

4.) Remove the CRXDE webapplication of CQ 5.3, as it is no longer needed for CRX 2.1

$ cd cq530/crx-quickstart/server/webapps
$ rm crx-de_crxde.war

5.) Edit also the server.xml, and remove the crxde webapp

6.) Define an order, in which the packages are deployed to CRX; as the packages are deployed in the order, they are listed by default in a shell, I define an order by explictly naming the files like “01_cq-content-5.3.jar”, “10_cq-documentation-5.3.zip” and so. The files must be placed in the cq530/crx-quickstart/repository/install folder.

Make sure that the original “cq_content-5.3.jar” is deployed as first package, as it contains the WCM code. But then you can place there any CQ package you want: hotfixes, custom application code, initial content etc.

$ cd cq530/crx-quickstart/repository/install
$ cp cq-content-5.3.jar 01_cq-content-5.3.jar
$ cp cq-documentation-5.3.zip 10_cq-documentation-5.3.zip
$ cp .........../cq-5.3.0-featurepack-30015-1.0.zip 50_cq-5.3.0-featurepack-30015-1.0.zip

7.) If you want to use the CRXDE, you should download the file cq53-update-crxdesupport-2.1.0.zip from Day PackageShare and copy it also into the install directory:

$ cp ......../cq53-update-crxdesupport-2.1.0.zip 05_cq530/crx/repository/install

8.) For convenience you can place now your license.properties file next in the toplevel directory of your installation, the result should be something like this:

$ ls -la
-rw-r--r--   1 jorghoh  staff  233110810  6 Aug 09:38 cq-wcm-quickstart-author-5.3.0.jar
drwxr-xr-x  10 jorghoh  staff        340  7 Okt 17:55 crx-quickstart
-rw-r--r--   1 jorghoh  staff        217  6 Aug 09:39 license.properties

If you don’t want to deliver the license file with that image, you can omitt it; if the instance is started the first time, it is asked then.

9.) Now all parts are in place; so you can create an image file (tar file) and distribute it all over your environments:

$ cd cq530/..
$ tar -cf cq530-crx21-author-image.tar cq530/*

(if you rename the cq-wcm-quickstart-author-5.3.0.jar file to cq-wcm-quickstart-publish-5.3.0.jar, you have an image for a publish instance.)

Just unpack it and use your usual startup mechanisms (“start.bat” or “start”), and the framework will startup as usually, create a repository and also deploy all packages in the install folder directly to it. If you encounter problems, you may check with the Felix console, if all bundles are started.

Now you have an image, which you can copy and uncompress everywhere, even the plattform (Unix, Windows) doesn’t matter, as all is only dependent on the features of CRX Launchpad and CRX itself. A setup of a new emtpy instance, independent of the number of installed packages and hotfixes, can now be done within 2 minutes and can be fully automated.

CQ5 logging

This week I held a workshop at a customer and someone asked me “How do other customers of Day handle their logfiles? Do they check them and analyze the logfiles?” I had to admit that “according to my experience nobody really cares about them. The only situation they care about them is when the disk is full f them.” Yeah, a sad truth.

But this brings us to todays topic: Logfiles and keeping track of them. CQ5 is by default pretty noisy; if you check the file crx-quickstart/logs/error.log after some requests have been made, you see a lot of messages of loglevel “INFO”. Yes, sometimes quite interesting, but in the end they pollute the log and the real important messages vanish in the pure mass of these noise. So, at least for production systems, the loglevel should incrased to WARN or even “ERROR”, so only logs at level WARN or ERROR are logged, INFO is supressed.

So, how can this be achieved? Sling as part of the WCM part of CQ5 brings its own logging, it can be configured using the Felix console and is well documented on the Day documentation site. CRX (at least up to CRX 2.1) does have its own logging mechanisms (log4j), which can be reconfigured in the crx-quickstart/server/runtime/0/_crx/WEB-INF/log4j.xml file.

And, on top of this all, we have on a standard Unix system

  • crx-quickstart/logs/stderr.log and crx-quickstart/logs/stdout.log
  • crx-quickstart/logs/server.log
  • crx-quickstart/server/logs/startup.log

neat, isn’t it? Ok, how can you configure them?

Short answer: you can’t. At least it isn’t documented.

The stdout.log and stderr.log and the standard output and standard error channel of the java process, which is redirected to these files. Especially stdout.log fills up pretty fast, because CRX logs all its messages also to the stdout. So fixing up the log4j.xml file is mandatory, because we don’t need this information twice in the crx/error.log and the stdout.log file. Oh, and of course these files aren’t rotated, but new data is appended only. So it grows and grows and grows.

The server.log file is written by the CQSE servlet engine and cleared when the servlet engine is started. Same as for the startup.log, which contains the output of the serverctl script before starting the java and also error messages, if the java process doesn’t start at all (most times due to invalid parameters).

A few recommendations (just a personal point of view):

  • Log rotation should be performed on a timely basis and not be based on the size of the logfile. You should have enough space then and monitor it closely, of course. But this helps you to lookup a certain problem (“Wait, it was yesterday, so it must be in error.log.0 file”) without hassles.
  • implement your own logfile rotation for the stdout.log and stderr.log files. I fill a bug for it too, but till then you need to help yourself. Sorry.
  • Increase loglevel to WARN. INFO just logs too much noise.
  • Adjust the log4j.xml of CRX and change it to something like this:
<root>
 <level value="warn" />
 <appender-ref ref="error"/>
 </root>

So adjusting the logging according to your needs shows, that you care about them and know, that they are useful at all. Which is a required step to do some analysis on them. But that topic is a candidate for one of the next postings.

CQ 5.3: Features for sysadmins

Last week the new version of CQ 5.3 has been released.
I will give you a short review of the changes which has impact on an infrastructure and administration level.

Package Share: Day introduced the concept of a package share, where packages can be downloaded. It is designed for 3 different usecases:

  1. CQ product hotfix distribution
  2. You can create your own private part and place packages there (distribution within your projects, sharing with Daycare support, …)
  3. Make packages public available (example code and componentes)

I don’t recommend to use package share in your production system for the following 2 reasons:

  • It requires an outgoing connection to a Day server (I haven’t seen any dedicated proxy support for it, so you need to add -Dhttp.proxyhost=$HOST -Dhttp.proxyPort=$PORT … parameters to your JVM parameters, if you need to do proxying); your firewalling/DMZ concept may also not allow outgoing connections.
  • It allows you to easy bypass the concept of staged code deployments and testing, that means: Deploy hotfixes directly into production, which is not recommended.

So just use it on your development machines to fetch the hotfixes and share packages. No need to request them via a Daycare ticket anymore.

CQ 5.3 requires Java 1.5 for all parts. With CQ 5.2 it was still possible to operate CRX with Java 1.4; with the upgrade to CRX 2.0 this is no longer possible.
(If you start from scratch, just take Java 1.6; the Java 1.5 by Sun/Oracle reached its end-of-life …)

By default the serverctl configures CQ_MAX_OPEN_FILES to 8192 (yeah!); this is a reasonable value and should be sufficient for most of our customers.

Config section in Apache Felix Console (CQ 5.3)
Config section in Apache Felix Console (CQ 5.3)

A new versionof the Felix console allows you to grab all configuration status with just a single click (creating a zip file); just check http://$HOST/system/console/config; this is a great feature from a support perspective. Gilles, you will love it :-).

A short list of topics I want to adress for the next release, to improve the CQ experience also for system administrators, who work with our great product:

  • remove all code related to Java versions prior to 1.5 from the startup scripts; there is some cruft in there, which should be removed.
  • apply a reasonable set of additional JVM parameters offered by Java 1.5; having them in place out-of-the-box could remove some JVM-related problems and provide better information for problem resolution.