Bootstrapping the CQ java process

In this article I will cover one special “feature” of the unix variant of CQ, which is the start script “serverctl”; its job is to perform the start and stop of the CQ java process. For Windows there is the server.bat version, which is kind of straight forward. The serverctl script is more complex and historically grown. I try to explain the basics of the CQ process handling, the handling of the java arguments and will show some ways to cope with the open-files problem. If you use CQ without the prepackaged CQSE, you don’t use this script, therefor this posting may not be interesting to you.

The first half of the script has actually no influence on the whole process handling, it is just preseeding parameters and parses the command line arguments. At about line 350 the interesting things start. In the default the CQ process is starting in the background; this start is triggered by the psmon process (actually the script starts a new incarnation of itself with the additional parameter “psmon”; in the process table it is visible with this “psmon” parameter, so I will just call it “psmon-process”).

This process performs the following important things:

  • Register a trap: when the TERM-signal is received it will remove the PID-files of its own process and the PID file of the Java process.
  • Start the terminator process (again a new instance of the process with the “terminator” parameter”) and attach its stdout filedescriptor to the stdin of the bgstart-process (same procedure here). This actually happens in a endless loop.

The terminator process is actually the another instance of the script. It also registers a trap (reacting on some signals). The main action in this trap-handler is that the string “QUIT” is written to stdout.

The bgstart process is another serverctl-instance, which creates the file with its own PID and then replaces itself completly by the final java process. Because the stdout filehandle of the terminator process is connected to the stdin-process of the bgstart process, it is inherited to the java process; so the terminator-process writes its “QUIT” to the stdin of the Java process!

Ok, the startup is quite clear now, so why the hell do we actually need 3 processes, 2 shell scripts and the java process? Well, let’s consider how the whole thing is shut down properly (and killing the Java process is not a good option).

The whole thing starts with the “stop” option (about line 570). First the TERM-signal is sent to the psmon-process, which then removes its own PID file and the file of the java process. Then the TERM-signal is sent to the terminator process, which writes the “QUIT” string to the java process. This string obviously tells the servlet engine to start a shutdown. Then the stop process waits up to 20 second and checks if the file vanishes. If after these 20 seconds the process is still there, it sends a TERM-signal to the java process (and assumes that this will do the rest and bring it down). But killing the java process isn’t really the friendly way, because the process will terminate immediately, leaving the shutdown process unfinished and the whole CQ/CRX in a inconsistent way. During next startup CRX will usually complain about a unclear shutdown; in most times you need to remove .lock files first before it really starts up properly.

Ok, finally the individual jobs of each process:

  • psmon: Restart in case of crashes
  • terminator: When issued, write “QUIT” to the java process
  • java process: actually do the work.

And for now a few tips:

Especially for large instances 20 seconds are often not enough to properly shutdown the whole CQ system. You want to increase this time when your system does not properly startup because either CRX performs some repair actions on startup; or it just refuses to startup because some .lock files are present. Then change the number in line 591 to something like this (I replaced the value 20 seconds by 300):

while [ -f "$CQ_LOGDIR/" ] && [ $COUNTER -lt 300 ]; do
  printf "."
  COUNTER=`expr $COUNTER + 1`
  sleep 1

This causes the script to wait up to 5 minutes, which should be enough for every CQ to shutdown. But if there are other problems, you have to wait 5 minutes until finally the kill happens.

When the java process was actually killed, it often leaves back some files in an inconsistent state. To stabilize the restart behaviour you may decide that you don’t want CQ to complain and stop during the regular startup; you just want CQ back in action asap and immediately start the recovery. You can add then the following lines to line 476 (before the “info” statement):

# remove all .lock files of a CQ crash/process kill
find $CQ_CONTEXT -name ".lock" | while read LOCKNAME; do
  warn "remove stale lock file $LOCKNAME"

Do this on your own risk, because you will never get to know if lockfiles have been left until you see CQ rebuilding its search and/or index files. But, a simple restart will your system bring back up (besides the case, when a crash killed things which cannot be recovered). With this proposed solution you have some data in the startup.log file, if (and which) stale lock files have been removed.

Another problem which often arises, is the number of open files. CQ5 and CRX as repository usually have many files open, a default installation has about 300 open files immediately after startup. If the number of requests increases and your repository grows, this number will grow too. At some point you will need to increase the maximum number of open files (in Unix speak: the value of ulimit -n).
By default this value is set to 1024 (CQ 5.2 and 5.2.1) by the serverctl script (until it is overriden by the value of CQ_MAX_OPEN_FILES in the wrapper script). Increasing this value by adjusting /etc/security/limits.conf (Redhat/Fedora) or via any other OS-preferred way does not help, the serverctl script always overrides this value. Applying this patch will fix this behaviour (to my readers from within Day: already reported in Bugzilla). A small patch for the “serverctl status” command will also print the configured value plus the current number of open files for the CQ process (also in Bugzilla).