CRX 2.1: Improved backup

CRX 2.1 introduced no major changes, but only smaller improvements in many places. In simple geometrixx-based performance tests you can get a speedup of about 30% (at least that’s the result I benchmarked with CRX 2.1 and CRX Hotfix 2.1.0.5), just because of improvements in Jackrabbit and CRX code. So, going to CRX 2.1 is a recommendation I would make for every CQ 5.3 project.

And next to these improvements a small extension to the online backup was introduced. Up to and including CRX 2.0 you backuped to a zip file (which contains the complete quickstart and the repository). In CRX 2.1 it is no longer required to zip the backuped data, but the backup procedure can simply drop it into a directory. This directory still contains all data (quickstart and the repository), so you run the zip programm yourself if you want. So, what’s the great deal here?

First of all, your backup software can do an incremental backup on this directory. Especially the tar files and the index will change quite often (think of the TarPM optimizer), but the datastore files won’t change, so they are ideal candidates for incremental backup.

Speaking of the datastore and its immutable files: Because files are only added to the datastore (and not changed), you can optimize the backup. The online backup backups the data based on its knowledge of the directory, where the quickstart is placed in. Everything in that directory and beneath is covered by the backup process. If you move the datastore to a directory outside the directory, where the quickstart.jar resides, the datastore isn’t backupped. Why is that good? Because if we always our backup in the same directory, we can use rsync to backup the datastore. Rsync is much better suited for this job, because it works incremental (CRX online backup doesn’t) and works outside of the CRX/CQ5 java process (which is always good). Remember: Files in the datastore are only added, not changed!

I recommend to change your CQ5 filesystem layout as follows:

cq5
 - backup
   - datastore
   - cq5
     - cq-wcm-quickstart.jar
     - crx-quickstart
     - ...
     - license.properties
 - datastore
 - cq5
   - cq-wcm-quickstart.jar
   - crx-quickstart
   - ...
   - license.properties

This layout ensures, that you would be able to start the CQ5 directly from the backup directory (altough you shouldn’t do it) or simply move it to the “production” location, without any configuration changes.

Moving the datastore is easy: Shutdown CQ5/CRX  and replace the line
<DataStore class="com.day.crx.core.data.ClusterDataStore"/>

in the repository.xml with the following ones:

<DataStore class="com.day.crx.core.data.ClusterDataStore">
<param name="path" value="../../../../backup/datastore" />
</DataStore>

I recommend to use a relative adress here; because it’s relative to the directory crx-quickstart/repository/repository, you should enter
"../../../../datastore” here. Adding a symlink won’t help!

Then create this directory structure and move all the datastore files there (crx-quickstart/repository/repository/datastore/*) and make sure, that the ACLs are set properly. Startup and have fun!

And now the rough skeleton of a backup script to illustrate the process; it requires curl, rsync and a little perl. For the windows-based systems I cannot give much advice; probably there is also some tool like rsync.


#!/bin/sh

HOST=localhost:4502
ADMINPW=admin # must be URL encoded!!

BACKUP_FILENAME="" # we want CRX to backup to the directory, no zip file!
BACKUP_DIR=/opt/cq5/backup
INSTANCE_DIR=/opt/cq5

COOKIE=cookie.txt
CURLPARAMETERS="-s -S"

# hacky, using a appropriate perl module would be better
ENCODED_BACKUP_DIR=`echo "${BACKUP_DIR} | perl -pe 's|\/|%2F|g'`
touch ${COOKIE}
chmod 600 ${COOKIE}
echo "Start: `date`"
curl -c ${COOKIE} ${CURLPARAMETERS} "http://${HOST}/crx/login.jsp?UserId=admin&Password=$ADMINPW&Workspace=crx.default" > /dev/null
curl -b ${COOKIE} ${CURLPARAMETERS} -o progress.txt "http://${HOST}/crx/config/backup.jsp?action=add&targetDir=${ENCODED_BACKUP_DIR}&zipFileName=${BACKUP_FILENAME}" > /dev/null
rm progress.txt
rm ${COOKIE}

echo "Syncing datastore"
DATASTORE_DIR="${INSTANCE_DIR}/datastore/"
DATASTORE_BACKUP_DIR="${BACKUP_DIR}/datastore"
rsync -a ${DATASTORE_DIR} ${DATASTORE_BACKUP_DIR}
echo "Finished: `date`"

Update:Fixed path in the repository.xml snippet. Thanks Thomas.

CQ 5.4: New features for sysadmins

Today the release announcement of CQ 5.4 has been published. I collected some changes since CQ 5.3, which are intersting to all of you guys, who will be involved in operating it.

  • CRX 2.2 supports the shared-nothing clustering.
  • CRXDE light is no longer a dedicated web application, but is included into the CRX webapp.
  • The CQSE serverctl script has been cleaned and simplified; so the description in “Bootstrapping the Java process” does no longer apply. Most of all the “serverctl psmon” process is gone now.
  • Since CRX 2.1 it is possible to perform a backup to a directory, not only to a ZIP file. This offers some more space for optimization. Expect a follow-up article on this.
  • The CRX Quickstart requires you to have a “ulimit -n” (number of open files) of at least 8192. It will probably affect your unpack run, but neither CQSE nor CQ5 itself does enforce it then …
  • It is still possible to build CQ5 custom images
  • CQ 5.4 offers an enhanced replication: There are these new options to a replication agent (on the “Trigger” tab): “no status update” and  “no versioning”. Let’s try to build a shadow copy of an author instance!

That’s for the moment. Have fun with CQ 5.4!

Reporting application problems

(I write this blog article with a certain background: In the Daycare ticketing system the support often needs to ask for additional information to start with an initial analysis of the issue report. This is a time-consuming task and increases the time-to-fix. So this a help for my colleagues at the Day support, but it should be also applicable for the contact with most enterprise support lines.)

Writing a bugreport is hard work. Many issue reporter often thinks, that the people, who are responsible for the application itself, just try to refuse to fix a bug and therefor ask questions and demand information, which are hard to deliver and which are absolutly clear to you. That these people don’t want to admit that their product has issues.

But these questions can often be easily answered, if you are well prepared.

Usually developers (and the support people, who work as first line for them) ask the following questions:

What software are you using, which versions, which additional fixes?
People often assume that these informations are already known to the support (especially if you’re dealing with N enterprise support); but these support lines often don’t track the software versions of their customers; and who knows, maybe you report an issue with a new version, which is currently only installed on your development systems.

Providing these informations on the opening of an issue report as a default informations helps the support to provide a quicker help. There’s no need to deal further with version information and ask for installed hotfix versions. At least one round of question – answer less.

A point for all software developers: Provide an facility to get all these informations without hassling with the package databases or registry of your systems. Keep these information automatically up-to-date when installing additional packages, fixes or enhancements.

What’s the impact of the reported issue?
Provide the impact of the problem, so the support can estimate the importance of the issue. A report on a wrongly documented feature gets another priority on the developers todo list than an hourly crashing ERP system.
Also provide the audience, which is affected by the issue. A non-working feature, which is offered as a vital part of your website is clearly more important than the same feature, if it’s non-functional for a small group of people; because for the latter is probably more easy to provide a workaround.
(It will probably super-important if these small group is the management, but that’s another topic …)

When was the issue spotted first?
This information may help to correlate the issue with other events; often problems get visible only under certain circumstances, which are not present at the start. This may be a system update (operating system, JVM, database, …), changed settings in the applications itself or just a heavier use of the system (more data, more users, higher peak usage). All these factors may increase the possibility that certain, yet unknown and unspotted problems get visible and harm your application.
(That’s the background of the famous quote “never change a running system”.)

It’s your task as an issue reporter to provide these information to the support. This information helps the support to focus on the impact of such changes, which very often reduces the amount of investigation dramatically.

So for example if you recently have just updated your Sun JVM due to security reasons from 1.5.0.8 to 1.5.0.11 and suddenly encounter spurious crashes, the developers may focus on the changes introduced by this JVM change and analyse their impact the application. Without this information you probably have to go through a long and painful analysis phase, when developers ask for all kind of dumps, JVM instrumentation and so on.

Is the problem reproducible
The question in which a developer is most interested in. If an issue can be reproduced it can be fixed. Because the developer can analyze the issue, understand the problem and then solve it, all without too much trial and error just to see the problem.
If an issue cannot be reproduced, a lot of information are not known. So maybe the problem occurs under conditions, which are there on your special system or with your special data. Trying to reproduce the issue on any other system is hard or impossible.
So this is one of the most important task of an issue reporter: Trying to provide a reproducable test case. If you are able to reproduce the issue, describe all the prerequisites and the steps to actually reproduce it. Be it a step-by-step documentation or by a little screencast, any appropriate format is welcome.

In the case of Day CQ the basis for testcases the playground/geometrixx application of a plain CQ installation can be used. So just install a plain CQ and make as few changes as possible to reprouce the problem.

If you can reproduce your problem on a plain CQ installation, you make the task of fixing your issue much more easy for Day. Time consuming analysis and making assumptions on a lot of parameters can be avoided then, and the developers may head directly to the issue itself.

Often you cannot reproduce the issue you want to report; either be it, that you don’t know the issue exactly (“my system just crashes”), or you cannot reproduce the problem, because it’s specific to a certain environment (“the crash only happens under heavy load; we couldn’t reproduce this crash using stress tests yet”). Then you need to provide as much information as possible.

Additional informations
Attach all available information (ok, not really _all_ information; only the one, which sounds usable, e.g. logfiles containing application specific logs, system dumps, threadumps for java applications, …) to your issue report.

If some special information is missing, the support will ask for it. But if you provide a certain standard set of information (depending on your application), this will be sufficient in 90%.

For a Day CQ installations these informations are the followings:

  • error.log of CQ
  • error.log of CRX
  • in case of performance problems: request.log, garbage collection log)
  • in case of performance problems and system lockups: threaddumps
  • in case of performance problems and out-of-memory-exceptions: threaddumps, heapdumps

Conclusion

For all these questions there are good reason why they are asked. I hope I showed you some of the background to understand these reasons.

So providing the right information directly from the start will reduce the time until you get support, which actually helps you in resolving your issue; or it can at least try to provide useful tips, which may help to establish workarounds. So in the longterm it helps both you as an issue reporter the support.

A good issue report

(inspired by the How To Ask Questions The Smart Way by Eric S Raymond)

In the last months I encountered in several situations, that people brought up issues like “the site isn’t working” or “the site is slow”, without more information. If am responsible for the thing in question, this leaves me a hard task: I am expected to fix a problem, for which I don’t have any information. It sometimes even doesn’t exist, because a situation is perceived as problem, which doesn’t exist on my site, but either at the system of the guy reporting it or somewhere in between. But we don’t know.

People who get such reports tend to have different strategies: One strategy would be just to reject such issues because “I cannot reproduce it on my system; the website is responding query quickly and behaves fine” and throw to trash (move the complaint email to trash or close the ticket in the issue system). Other just ignore them (for the same reason), but don’t throw the reporting message into trash. A third approach is to request more information on this issue. That would be a good approach, but either the reporter do not react anymore (because they are just busy), the problem is gone in the meanwhile (for whatever reason), or they cannot provide the requested information anymore. Also not a satisfying solution.

So the only remaining solution is to force people to provide the required information with the initial issue report. If you ask people to describe their problem very closely and detailled, they like to provide this information to you, because they feel, that somebody really likes to solve their problem. But because they are not aware that (and which) information is needed for the resolution of an issue, they tend to provide no information at all.

So the goal is to have a list of things, which have to be provided by the reporter of an issue along with the issue. So in the end an issue report would look like this:

“At about 1:40pm on September 23st 2009 I requested the page http://www.abc.foo/a/b.html; I received only a damaged page, with some pictures missing. When I tried to login using my credentials (my username is hbt85), I received an internal server error page; I used the firefox browser installed on my corporate computer.”

Altough this report is very brief (and probably doesn’t much time to report), it contains valuable information, so a system administrator can immediately start to look for the problem and has a realistic chance to find traces of the described issue (in the logfiles, in system dump or in the application monitoring). Because it contains the following important information:

  • Who is the reporter? (Not only the eMail adress or the full name of the user, but also by providing the username in the affected system)
  • What system is affected? (given by the exact hostname)
  • When does the issue occur? (time and date)
  • What has the reporter done and what were the effects of it?

So the most time consuming task of every support is to qualify every incoming request to such a level, that a qualified guy can take a look at the system to identify and fix the issue. In the way to qualify such problems it very often turns out that the cause of the problem lies on the user side, being it either missing training on the system resulting in wrong usage, missing or incorrect documentation or just errors on the user-site. In our example it could be that the user just used a wrong URL, he should have used “http://www.abc.foo/a-new/b.html&#8221;, but he missed the mail announcing this change.

So a major job of every support organisation is to have a prepared, up-to-date list of relevant information, which are needed to resolve an issue. So if you are in a support organisation, provide a list of information pieces, which must be provided by a issue reporter. And if you’re a experienced user and want to report an issue and you want it really fixed, provide as many information as possible. There is no “too much information” regarding a specific problem.

Meta: new Job, Ignite 2009

Although the times seem to be hard with the financial crisis, which hit many companies and decreased IT budgets, I decided to leave my old job. And since October 1st I am employed as an Solution Architect at Day Software and will be working in Frankfurt am Main (Germany). At the moment I am in Basel at the Barfüsserplatz office, get to know the people (hey, you really rock!) and the products. So I will be able to cover also the latest versions in this blog.

I will also visit the customer summit this year in Zurich (only thursday), meeting people and learning about our prodcuts 🙂 So see you there!

Traversing the content hierarchy

When you play around with websites, you often get a good feeling how much “engineering” work and thoughts are put into the site. Think of things like SQL injection and shell escape injection, which were a problem a few years ago. If you encounter today a site which is vulnerable to such a problem, it’s either a problem of the budget (which is a lame excuse because modern frameworks avoid such problems) or the skill of the developer. An up-to-date site isn’t affected anymore by such problems.

A problem, which is clearly visible, but often not known to application developers and architects is the content structure, which is exposed to the user by the URL. Consider the following small example for a simple content structure:

/content/brand/en/home

for the startpage of a CQ-based website. The “home”-handle is the startpage and is thus called via

www.example.com/content/brand/en/home.html

So, where’s the problem? Well, most templates provide a kind of HTML representation of their content. So let’s try

www.example.com/content/brand/en.html

maybe also a structure handle such as the language node (which is often just used to differentiate between languages) does also provide a HTML representation of its content; so it could just render its child nodes as a dotted list.

So, what’s then? Does it harm, if you reveal, that you provide beside english also chinese content? No, it doesn’t. Most times it doesn’t. But when you already have fresh content ready, but not yet linked? It would appear in such a list. Or if you have “hidden content”, functions which are known only to a small group of people? Things, which aren’t secured by authentication and authorization. Suddenly someone has found your private data and could make use of it.

The trash can for functionality is often a folder named “tools”; developers tend to place everything there which doesn’t fit well into any other category. So you can find there contact forms, search functionality and other stuff. So what happens if you call

www.example.com/content/brand/en/home/tools.html

Does it also your show unused/crappy/new functions, which aren’t used in the website, but are still there? Because for convenience some developer thought, it would be cool to have all tools listed without major hassle (1 bookmark instead of 10). Bad idea, you just showed all your available tools to someone, who shouldn’t see them.

So check you your site, that strucutre nodes, which are only used to structure your content, cannot be rendered at all or don’t reveal any information, which could be useful for an attacker. Either return an empty page or (suggested) return the HTTP statuscode “403” (access denied). Don’t reveal data when it isn’t necessary. A well-engineered site also takes care of such “attacks” and doesn’t reveal any data which could be of use for a potential attacker.

I’ve already done such tests on several CQ-based websites and found (beside some other things) a monitoring page (containing version information of used libraries) and also a hidden webspecial which was dedicated to a member of the webteam, heading for another location (hi, Katrin!). All of these information were public viewable (on a major corporate website!) just by playing around with path names and following then links.

CQ Dispatcher 4.0.3 available.

A few days ago Day released a new version of the Day CQ Dispatcher plugin. As one of the most important topics in the this release is the number of supported plattforms. The dispatcher ships now as Apache Module for Solaris Sparc and x86, both for Apache 2.0 and Apache 2.2.  Finally the most relevant plattforms are supported for Apache 2.0 and  2.2 (the exception to this rule is AIX). A few bugfixes for Windows and MacOS-X plattforms and at last, again a fix for the permission sensitive caching.

Permission sensitive caching will be next topic here, so stay tuned.