CQ5 logging

This week I held a workshop at a customer and someone asked me “How do other customers of Day handle their logfiles? Do they check them and analyze the logfiles?” I had to admit that “according to my experience nobody really cares about them. The only situation they care about them is when the disk is full f them.” Yeah, a sad truth.

But this brings us to todays topic: Logfiles and keeping track of them. CQ5 is by default pretty noisy; if you check the file crx-quickstart/logs/error.log after some requests have been made, you see a lot of messages of loglevel “INFO”. Yes, sometimes quite interesting, but in the end they pollute the log and the real important messages vanish in the pure mass of these noise. So, at least for production systems, the loglevel should incrased to WARN or even “ERROR”, so only logs at level WARN or ERROR are logged, INFO is supressed.

So, how can this be achieved? Sling as part of the WCM part of CQ5 brings its own logging, it can be configured using the Felix console and is well documented on the Day documentation site. CRX (at least up to CRX 2.1) does have its own logging mechanisms (log4j), which can be reconfigured in the crx-quickstart/server/runtime/0/_crx/WEB-INF/log4j.xml file.

And, on top of this all, we have on a standard Unix system

  • crx-quickstart/logs/stderr.log and crx-quickstart/logs/stdout.log
  • crx-quickstart/logs/server.log
  • crx-quickstart/server/logs/startup.log

neat, isn’t it? Ok, how can you configure them?

Short answer: you can’t. At least it isn’t documented.

The stdout.log and stderr.log and the standard output and standard error channel of the java process, which is redirected to these files. Especially stdout.log fills up pretty fast, because CRX logs all its messages also to the stdout. So fixing up the log4j.xml file is mandatory, because we don’t need this information twice in the crx/error.log and the stdout.log file. Oh, and of course these files aren’t rotated, but new data is appended only. So it grows and grows and grows.

The server.log file is written by the CQSE servlet engine and cleared when the servlet engine is started. Same as for the startup.log, which contains the output of the serverctl script before starting the java and also error messages, if the java process doesn’t start at all (most times due to invalid parameters).

A few recommendations (just a personal point of view):

  • Log rotation should be performed on a timely basis and not be based on the size of the logfile. You should have enough space then and monitor it closely, of course. But this helps you to lookup a certain problem (“Wait, it was yesterday, so it must be in error.log.0 file”) without hassles.
  • implement your own logfile rotation for the stdout.log and stderr.log files. I fill a bug for it too, but till then you need to help yourself. Sorry.
  • Increase loglevel to WARN. INFO just logs too much noise.
  • Adjust the log4j.xml of CRX and change it to something like this:
<root>
 <level value="warn" />
 <appender-ref ref="error"/>
 </root>

So adjusting the logging according to your needs shows, that you care about them and know, that they are useful at all. Which is a required step to do some analysis on them. But that topic is a candidate for one of the next postings.

Bootstrapping the CQ java process

In this article I will cover one special “feature” of the unix variant of CQ, which is the start script “serverctl”; its job is to perform the start and stop of the CQ java process. For Windows there is the server.bat version, which is kind of straight forward. The serverctl script is more complex and historically grown. I try to explain the basics of the CQ process handling, the handling of the java arguments and will show some ways to cope with the open-files problem. If you use CQ without the prepackaged CQSE, you don’t use this script, therefor this posting may not be interesting to you.

The first half of the script has actually no influence on the whole process handling, it is just preseeding parameters and parses the command line arguments. At about line 350 the interesting things start. In the default the CQ process is starting in the background; this start is triggered by the psmon process (actually the script starts a new incarnation of itself with the additional parameter “psmon”; in the process table it is visible with this “psmon” parameter, so I will just call it “psmon-process”).

This process performs the following important things:

  • Register a trap: when the TERM-signal is received it will remove the PID-files of its own process and the PID file of the Java process.
  • Start the terminator process (again a new instance of the process with the “terminator” parameter”) and attach its stdout filedescriptor to the stdin of the bgstart-process (same procedure here). This actually happens in a endless loop.

The terminator process is actually the another instance of the script. It also registers a trap (reacting on some signals). The main action in this trap-handler is that the string “QUIT” is written to stdout.

The bgstart process is another serverctl-instance, which creates the cq.pid file with its own PID and then replaces itself completly by the final java process. Because the stdout filehandle of the terminator process is connected to the stdin-process of the bgstart process, it is inherited to the java process; so the terminator-process writes its “QUIT” to the stdin of the Java process!

Ok, the startup is quite clear now, so why the hell do we actually need 3 processes, 2 shell scripts and the java process? Well, let’s consider how the whole thing is shut down properly (and killing the Java process is not a good option).

The whole thing starts with the “stop” option (about line 570). First the TERM-signal is sent to the psmon-process, which then removes its own PID file and the cq.pid file of the java process. Then the TERM-signal is sent to the terminator process, which writes the “QUIT” string to the java process. This string obviously tells the servlet engine to start a shutdown. Then the stop process waits up to 20 second and checks if the cq.pid file vanishes. If after these 20 seconds the process is still there, it sends a TERM-signal to the java process (and assumes that this will do the rest and bring it down). But killing the java process isn’t really the friendly way, because the process will terminate immediately, leaving the shutdown process unfinished and the whole CQ/CRX in a inconsistent way. During next startup CRX will usually complain about a unclear shutdown; in most times you need to remove .lock files first before it really starts up properly.

Ok, finally the individual jobs of each process:

  • psmon: Restart in case of crashes
  • terminator: When issued, write “QUIT” to the java process
  • java process: actually do the work.

And for now a few tips:

Especially for large instances 20 seconds are often not enough to properly shutdown the whole CQ system. You want to increase this time when your system does not properly startup because either CRX performs some repair actions on startup; or it just refuses to startup because some .lock files are present. Then change the number in line 591 to something like this (I replaced the value 20 seconds by 300):


COUNTER=0
while [ -f "$CQ_LOGDIR/cq.pid" ] && [ $COUNTER -lt 300 ]; do
  printf "."
  COUNTER=`expr $COUNTER + 1`
  sleep 1
done

This causes the script to wait up to 5 minutes, which should be enough for every CQ to shutdown. But if there are other problems, you have to wait 5 minutes until finally the kill happens.

When the java process was actually killed, it often leaves back some files in an inconsistent state. To stabilize the restart behaviour you may decide that you don’t want CQ to complain and stop during the regular startup; you just want CQ back in action asap and immediately start the recovery. You can add then the following lines to line 476 (before the “info” statement):


# remove all .lock files of a CQ crash/process kill
find $CQ_CONTEXT -name ".lock" | while read LOCKNAME; do
  warn "remove stale lock file $LOCKNAME"
  rm $LOCKNAME
done

Do this on your own risk, because you will never get to know if lockfiles have been left until you see CQ rebuilding its search and/or index files. But, a simple restart will your system bring back up (besides the case, when a crash killed things which cannot be recovered). With this proposed solution you have some data in the startup.log file, if (and which) stale lock files have been removed.

Another problem which often arises, is the number of open files. CQ5 and CRX as repository usually have many files open, a default installation has about 300 open files immediately after startup. If the number of requests increases and your repository grows, this number will grow too. At some point you will need to increase the maximum number of open files (in Unix speak: the value of ulimit -n).
By default this value is set to 1024 (CQ 5.2 and 5.2.1) by the serverctl script (until it is overriden by the value of CQ_MAX_OPEN_FILES in the start.sh wrapper script). Increasing this value by adjusting /etc/security/limits.conf (Redhat/Fedora) or via any other OS-preferred way does not help, the serverctl script always overrides this value. Applying this patch will fix this behaviour (to my readers from within Day: already reported in Bugzilla). A small patch for the “serverctl status” command will also print the configured value plus the current number of open files for the CQ process (also in Bugzilla).