Bootstrapping the CQ java process

In this article I will cover one special “feature” of the unix variant of CQ, which is the start script “serverctl”; its job is to perform the start and stop of the CQ java process. For Windows there is the server.bat version, which is kind of straight forward. The serverctl script is more complex and historically grown. I try to explain the basics of the CQ process handling, the handling of the java arguments and will show some ways to cope with the open-files problem. If you use CQ without the prepackaged CQSE, you don’t use this script, therefor this posting may not be interesting to you.

The first half of the script has actually no influence on the whole process handling, it is just preseeding parameters and parses the command line arguments. At about line 350 the interesting things start. In the default the CQ process is starting in the background; this start is triggered by the psmon process (actually the script starts a new incarnation of itself with the additional parameter “psmon”; in the process table it is visible with this “psmon” parameter, so I will just call it “psmon-process”).

This process performs the following important things:

  • Register a trap: when the TERM-signal is received it will remove the PID-files of its own process and the PID file of the Java process.
  • Start the terminator process (again a new instance of the process with the “terminator” parameter”) and attach its stdout filedescriptor to the stdin of the bgstart-process (same procedure here). This actually happens in a endless loop.

The terminator process is actually the another instance of the script. It also registers a trap (reacting on some signals). The main action in this trap-handler is that the string “QUIT” is written to stdout.

The bgstart process is another serverctl-instance, which creates the cq.pid file with its own PID and then replaces itself completly by the final java process. Because the stdout filehandle of the terminator process is connected to the stdin-process of the bgstart process, it is inherited to the java process; so the terminator-process writes its “QUIT” to the stdin of the Java process!

Ok, the startup is quite clear now, so why the hell do we actually need 3 processes, 2 shell scripts and the java process? Well, let’s consider how the whole thing is shut down properly (and killing the Java process is not a good option).

The whole thing starts with the “stop” option (about line 570). First the TERM-signal is sent to the psmon-process, which then removes its own PID file and the cq.pid file of the java process. Then the TERM-signal is sent to the terminator process, which writes the “QUIT” string to the java process. This string obviously tells the servlet engine to start a shutdown. Then the stop process waits up to 20 second and checks if the cq.pid file vanishes. If after these 20 seconds the process is still there, it sends a TERM-signal to the java process (and assumes that this will do the rest and bring it down). But killing the java process isn’t really the friendly way, because the process will terminate immediately, leaving the shutdown process unfinished and the whole CQ/CRX in a inconsistent way. During next startup CRX will usually complain about a unclear shutdown; in most times you need to remove .lock files first before it really starts up properly.

Ok, finally the individual jobs of each process:

  • psmon: Restart in case of crashes
  • terminator: When issued, write “QUIT” to the java process
  • java process: actually do the work.

And for now a few tips:

Especially for large instances 20 seconds are often not enough to properly shutdown the whole CQ system. You want to increase this time when your system does not properly startup because either CRX performs some repair actions on startup; or it just refuses to startup because some .lock files are present. Then change the number in line 591 to something like this (I replaced the value 20 seconds by 300):


COUNTER=0
while [ -f "$CQ_LOGDIR/cq.pid" ] && [ $COUNTER -lt 300 ]; do
  printf "."
  COUNTER=`expr $COUNTER + 1`
  sleep 1
done

This causes the script to wait up to 5 minutes, which should be enough for every CQ to shutdown. But if there are other problems, you have to wait 5 minutes until finally the kill happens.

When the java process was actually killed, it often leaves back some files in an inconsistent state. To stabilize the restart behaviour you may decide that you don’t want CQ to complain and stop during the regular startup; you just want CQ back in action asap and immediately start the recovery. You can add then the following lines to line 476 (before the “info” statement):


# remove all .lock files of a CQ crash/process kill
find $CQ_CONTEXT -name ".lock" | while read LOCKNAME; do
  warn "remove stale lock file $LOCKNAME"
  rm $LOCKNAME
done

Do this on your own risk, because you will never get to know if lockfiles have been left until you see CQ rebuilding its search and/or index files. But, a simple restart will your system bring back up (besides the case, when a crash killed things which cannot be recovered). With this proposed solution you have some data in the startup.log file, if (and which) stale lock files have been removed.

Another problem which often arises, is the number of open files. CQ5 and CRX as repository usually have many files open, a default installation has about 300 open files immediately after startup. If the number of requests increases and your repository grows, this number will grow too. At some point you will need to increase the maximum number of open files (in Unix speak: the value of ulimit -n).
By default this value is set to 1024 (CQ 5.2 and 5.2.1) by the serverctl script (until it is overriden by the value of CQ_MAX_OPEN_FILES in the start.sh wrapper script). Increasing this value by adjusting /etc/security/limits.conf (Redhat/Fedora) or via any other OS-preferred way does not help, the serverctl script always overrides this value. Applying this patch will fix this behaviour (to my readers from within Day: already reported in Bugzilla). A small patch for the “serverctl status” command will also print the configured value plus the current number of open files for the CQ process (also in Bugzilla).

Application monitoring vs System monitoring

Recently I was asked how a CQ monitoring should be setup. The question was acompanied by a very short description, how the monitoring was supposed to look like. There were some points like “monitoring the publishing by requesting pages”, “check if the java process is running” and “checking the free disk space”. Obviously they just setup some new servers for this environment and thought that they need to monitor some parameters.

As a first step I advised to separate the topics “application monitoring” and “system monitoring”. One might wonder why I suggest to make a strong division between these topics, so here the background.

Standardization is one of the key topics in IT; everything, what is standardized, can be reused, can be exchanged by a compatible product, and finally lowers the cost. So IT operation teams tend to standardize as much as they can, because as intermediate step to lower costs standardization allows automation.

Basic system monitoring is such a thing. Every computer has componentes, which can be monitored such way: Disk health, CPU temperature, status of the power supply units, internal temperature. But also CPU utlization, free disk space, network connectivity or if the system starts to swap. And many more. These are basic metrics which can be measured and monitored in a consistent and automatic way.

For these points it doesn’t care if the system runs a data warehouse application, a mailserver or CQ. They are all the same and the reaction is really comparable if one of these monitored things fails: If a disk is dead, one needs to replace it (with not-so-old servers you can do this online and without service interruption). The procedure may differ from computer to computer, but the basic action is always the same: When the monitoring shows that a disk failed, lookup the type of the failed disk, get a new one, and go the computer and replace it according to the guidelines of the computer manufacturer. That’s it. You can handle some thousand servers that way with only a few people.

Running applications isn’t standardized that way. One application requires a Windows Server, other run because of their history only on big iron. One vendor offers performance guarantees only for linux systems, and other vendors don’t care about the platform as long as they have a Websphere Application Server as base. Some applications are designed to run centralized, other applications can be clustered. Some have good logging and messages you can use for diagnosis, others don’t have that and error causes must be detected with system tools like truss or strace.
So applications are highly non-standardized and often need special skill and knowledge in order to operate them. Automatisation is a very hard job here, and there must be support by management to get every part of the organisation in the right direction.

(As a side note: In my former life before I joined Day I worked in a large IT operation organisation where every application was somehow non-standard; some less, but also some completly out of every order. IT tried its best to create some kind of standardization, but the busineses often didn’t care that much about it; also developers didn’t knew much about IT operations, so “but it works on my machine!!” and “Just open the firewall, so these 2 components can talk to each other” was often heard in early project stages.)

These applications also need completly different kinds of monitoring. The implementation for SAP monitoring looks different than the application monitoring for a web application. The actions the take in case of problems probably differ even more; and when it comes to investigate on errors the webapplication administrator cannot do anything on the SAP system. And vice-versa.

So it’s advisable to separate the monitoring into 2 parts: The basic system monitoring and the application monitoring.

The system monitoring part can be done by one team for all servers. The application monitoring is too complex and too different, the actions sometime require so often special knowhow, that it must be adjustable to the needs of each application and application administrators.

As a final conclusion: Everytime a computer system is setup, put it into the basic system monitoring. So failing disks can get replaced.
And when the application administrator deploys the application on it, the special monitoring stuff is installed then.
Just because the needs and skills, which it takes to react on monitored issues, are very different.

Basic performance tuning: Caching

Many CQ installations I’ve seen start with the default configuration of CQ. This is in fact a good decision, because the default configuration can handle small and middle installations very well. And additionally you don’t have to maintain a bunch of configuration files and settings; and finally most CQ hotfixes (which are delivered without the QA) are only tested with default installations.

So when you start with your project and you have a pristine CQ installation, the performance of both publishing and authoring instances are usually very good, the UI is responsive, page load times in the 2-digit miliseconds. Great. Excellent.

When your site grows, when the content authors start their work, you need to do your first performance and stress tests using numbers provided by the requirements (“the site must be able to handle 10000 concurrent requests per second with a maximal response time of 2 seconds”). You either can overcome such requirements by throwing hardware on the problem (“we must use 6 publishers each on a 4-core machine”) or you just try to optimize your site. Okay, let’s try it with optimization first.

Caching is a thing which comes to mind first. You can cache on several layers of the application, be it application level (caches builtin into the application, like the outputcache of CQ 3 and 4), the dispatcher cache (as described here in this blog), or on the users system (using the browser cache). Each cache layer should decrease the number of requests in the remaining caches, so that in the end only the requests get through, which cannot be handled in a cache, but must be processed in CQ. Our goal is to move the files into a cache which is nearest to the enduser; then loading of these files is faster than if the load is performed from a location which is 20 000 kilometers away.

(A system engineer may also be interested in that solution, because it will offload data traffic from the internet connection. Leaves more capacity for other interesting things …)

If you start from scratch with performance tuning, grasping for the low-hanging fruits is the way to go. So you start into an iterative process, which contains of the following steps:

  1. Identify requests which can be handled by a caching layer which is placed nearer to the enduser.
  2. Identify actions, which allows to cache these requests in a cache next to the user.
  3. Perform these actions
  4. Measure the results using appropriate tools
  5. Start over from (1)

(For a more broader view to performance tuning, see David Nueschelers post on the Day developer site)

As an example I will go through this cycle on the authoring system. I start with a random look at the request.log, which may look like this:

09/Oct/2009:09:08:03 +0200 [8] -> GET /libs/wcm/content/welcome.html HTTP/1.1
09/Oct/2009:09:08:06 +0200 [8] <- 200 text/html; charset=utf-8 3016ms
09/Oct/2009:09:08:12 +0200 [9] -> GET / HTTP/1.1
09/Oct/2009:09:08:12 +0200 [9] <- 302 - 29ms
09/Oct/2009:09:08:12 +0200 [10] -> GET /index.html HTTP/1.1
09/Oct/2009:09:08:12 +0200 [10] <- 302 - 2ms
09/Oct/2009:09:08:12 +0200 [11] -> GET /libs/wcm/content/welcome.html HTTP/1.1
09/Oct/2009:09:08:13 +0200 [11] <- 200 text/html; charset=utf-8 826ms
09/Oct/2009:09:08:13 +0200 [12] -> GET /libs/wcm/welcome/resources/welcome.css HTTP/1.1
09/Oct/2009:09:08:13 +0200 [12] <- 200 text/css 4ms
09/Oct/2009:09:08:13 +0200 [13] -> GET /libs/wcm/welcome/resources/ico_siteadmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [14] -> GET /libs/wcm/welcome/resources/ico_misc.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [15] -> GET /libs/wcm/welcome/resources/ico_useradmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [15] <- 200 image/png 8ms
09/Oct/2009:09:08:13 +0200 [16] -> GET /libs/wcm/welcome/resources/ico_damadmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [16] <- 200 image/png 5ms
09/Oct/2009:09:08:13 +0200 [13] <- 200 image/png 17ms
09/Oct/2009:09:08:13 +0200 [14] <- 200 image/png 17ms
09/Oct/2009:09:08:13 +0200 [17] -> GET /libs/wcm/welcome/resources/welcome_bground.gif HTTP/1.1
09/Oct/2009:09:08:13 +0200 [17] <- 200 image/gif 3ms

Ok, it looks like that some of such requests must not be handled by CQ: the PNG files and the CSS files. These files usually never change (or at least change very seldom, maybe on a deployment or when a hotfix is deployed). But for the usual daily work of an content author they can be assumed to be static, but we must of course provide a way that we enable the authors to fetch a new one, when an update to one them occurs. Ok, that was step 1: We want to cache the PNG and the CSS files which are placed below /libs.

Step 2: How can we cache these files? We don’t want to cache them within CQ (that wouldn’t bring any improvement), so remains dispatcher and browser cache. In this case I recommend to cache them in the browser cache for 2 reasons:

  • These files are requested more than once during a typical authoring session, so it makes sense to cache directly in the browser cache.
  • Latency of the browser cache is ways lower than the latency of any load from the network.

As an additional restriction which speaks against the dispatcher:

  • There are no flusing agents for authoring mode, so we cannot use the dispatcher that easily. So in the case of tuning an authoring instance we cannot use the dispatcher cache.

And to make any changes to these files made on the server visible to the user, we can use the expiration feature of HTTP. This allows us to specify a time-to-live, which basically tells any interested party, how long we consider this file up-to-date. When this time is reached, every party, which cached it, should remove it from cache and refetch.
This isn’t the perfect solution, because a browser will drop the file from its cache and refetch it from time to time, although the file is still valid and up-to-date.
But there’s still an improvement, if the browser fetches this files every hour instead of twice a minute (when a page load occurs).

Our prognose is, that the browser of an authoring user won’t perform that much requests on files anymore; this will increase the rendering performance of the page (the files are fetched from the fast browsercache instead from the server), and additionally the load on the CQ will decrease, because it doesn’t need to handle that much requests. Good for all parties.

Step 3: We implement this feature in the apache webserver, which we have placed in front of our CQ authoring system and add the following statements:

<LocationMatch /libs>
ExpiresByType image/png "access plus 1 hour"
ExpiresByType text/css "access plus 1 hour"
</LocationMatch>

Instead of relying on file extensions we specify here the expiration by the MIME-type in these rules. The files are considered to be up-to-date for an hour, so the browser will reload these files every hour. This value should be ok also in case these files are changed once. And if everything fails, the authoring users can drop their browser cache.

Step 4: We measure the effect of our changes using 2 different strategies: First we observe the request.log again and check if these requests appear further on. If the server is already heavy loaded, we can additionally check for a decreasing load and an improved response times for the remaining requests. As a second option we take a simple use case of an authoring user and run it with Firefox’ Firebug extension enabled. This plugin can visualize how and when the load of the parts of a page happen, and display the response times quite exactly. You should see now, that the number of files requested over the network has decreased and the load of a page and all its emnbedded objects is faster than before.

So with an quick and easy-to-perform action you have decreased the page load times. When I added expiration headers to a number of static images, javascripts and css files on a publishing instance, the number of requests which went over the wire went down to 50%, the pageload times also decreased, so that even during a stress test the site still had a good performance. Of course, dynamic parts must be handled by their respective systems, but if we can offload requests from CQ, we should do this.

So as a conclusion: Some very basic changes to the system (some configuration adjustments to the apache config) may increase the speed of your site (publishing and authoring) dramatically. Such changes as described are not invasive to the system and are highly adjustible to the specific needs and requirements of your application.

Reporting application problems

(I write this blog article with a certain background: In the Daycare ticketing system the support often needs to ask for additional information to start with an initial analysis of the issue report. This is a time-consuming task and increases the time-to-fix. So this a help for my colleagues at the Day support, but it should be also applicable for the contact with most enterprise support lines.)

Writing a bugreport is hard work. Many issue reporter often thinks, that the people, who are responsible for the application itself, just try to refuse to fix a bug and therefor ask questions and demand information, which are hard to deliver and which are absolutly clear to you. That these people don’t want to admit that their product has issues.

But these questions can often be easily answered, if you are well prepared.

Usually developers (and the support people, who work as first line for them) ask the following questions:

What software are you using, which versions, which additional fixes?
People often assume that these informations are already known to the support (especially if you’re dealing with N enterprise support); but these support lines often don’t track the software versions of their customers; and who knows, maybe you report an issue with a new version, which is currently only installed on your development systems.

Providing these informations on the opening of an issue report as a default informations helps the support to provide a quicker help. There’s no need to deal further with version information and ask for installed hotfix versions. At least one round of question – answer less.

A point for all software developers: Provide an facility to get all these informations without hassling with the package databases or registry of your systems. Keep these information automatically up-to-date when installing additional packages, fixes or enhancements.

What’s the impact of the reported issue?
Provide the impact of the problem, so the support can estimate the importance of the issue. A report on a wrongly documented feature gets another priority on the developers todo list than an hourly crashing ERP system.
Also provide the audience, which is affected by the issue. A non-working feature, which is offered as a vital part of your website is clearly more important than the same feature, if it’s non-functional for a small group of people; because for the latter is probably more easy to provide a workaround.
(It will probably super-important if these small group is the management, but that’s another topic …)

When was the issue spotted first?
This information may help to correlate the issue with other events; often problems get visible only under certain circumstances, which are not present at the start. This may be a system update (operating system, JVM, database, …), changed settings in the applications itself or just a heavier use of the system (more data, more users, higher peak usage). All these factors may increase the possibility that certain, yet unknown and unspotted problems get visible and harm your application.
(That’s the background of the famous quote “never change a running system”.)

It’s your task as an issue reporter to provide these information to the support. This information helps the support to focus on the impact of such changes, which very often reduces the amount of investigation dramatically.

So for example if you recently have just updated your Sun JVM due to security reasons from 1.5.0.8 to 1.5.0.11 and suddenly encounter spurious crashes, the developers may focus on the changes introduced by this JVM change and analyse their impact the application. Without this information you probably have to go through a long and painful analysis phase, when developers ask for all kind of dumps, JVM instrumentation and so on.

Is the problem reproducible
The question in which a developer is most interested in. If an issue can be reproduced it can be fixed. Because the developer can analyze the issue, understand the problem and then solve it, all without too much trial and error just to see the problem.
If an issue cannot be reproduced, a lot of information are not known. So maybe the problem occurs under conditions, which are there on your special system or with your special data. Trying to reproduce the issue on any other system is hard or impossible.
So this is one of the most important task of an issue reporter: Trying to provide a reproducable test case. If you are able to reproduce the issue, describe all the prerequisites and the steps to actually reproduce it. Be it a step-by-step documentation or by a little screencast, any appropriate format is welcome.

In the case of Day CQ the basis for testcases the playground/geometrixx application of a plain CQ installation can be used. So just install a plain CQ and make as few changes as possible to reprouce the problem.

If you can reproduce your problem on a plain CQ installation, you make the task of fixing your issue much more easy for Day. Time consuming analysis and making assumptions on a lot of parameters can be avoided then, and the developers may head directly to the issue itself.

Often you cannot reproduce the issue you want to report; either be it, that you don’t know the issue exactly (“my system just crashes”), or you cannot reproduce the problem, because it’s specific to a certain environment (“the crash only happens under heavy load; we couldn’t reproduce this crash using stress tests yet”). Then you need to provide as much information as possible.

Additional informations
Attach all available information (ok, not really _all_ information; only the one, which sounds usable, e.g. logfiles containing application specific logs, system dumps, threadumps for java applications, …) to your issue report.

If some special information is missing, the support will ask for it. But if you provide a certain standard set of information (depending on your application), this will be sufficient in 90%.

For a Day CQ installations these informations are the followings:

  • error.log of CQ
  • error.log of CRX
  • in case of performance problems: request.log, garbage collection log)
  • in case of performance problems and system lockups: threaddumps
  • in case of performance problems and out-of-memory-exceptions: threaddumps, heapdumps

Conclusion

For all these questions there are good reason why they are asked. I hope I showed you some of the background to understand these reasons.

So providing the right information directly from the start will reduce the time until you get support, which actually helps you in resolving your issue; or it can at least try to provide useful tips, which may help to establish workarounds. So in the longterm it helps both you as an issue reporter the support.

A good issue report

(inspired by the How To Ask Questions The Smart Way by Eric S Raymond)

In the last months I encountered in several situations, that people brought up issues like “the site isn’t working” or “the site is slow”, without more information. If am responsible for the thing in question, this leaves me a hard task: I am expected to fix a problem, for which I don’t have any information. It sometimes even doesn’t exist, because a situation is perceived as problem, which doesn’t exist on my site, but either at the system of the guy reporting it or somewhere in between. But we don’t know.

People who get such reports tend to have different strategies: One strategy would be just to reject such issues because “I cannot reproduce it on my system; the website is responding query quickly and behaves fine” and throw to trash (move the complaint email to trash or close the ticket in the issue system). Other just ignore them (for the same reason), but don’t throw the reporting message into trash. A third approach is to request more information on this issue. That would be a good approach, but either the reporter do not react anymore (because they are just busy), the problem is gone in the meanwhile (for whatever reason), or they cannot provide the requested information anymore. Also not a satisfying solution.

So the only remaining solution is to force people to provide the required information with the initial issue report. If you ask people to describe their problem very closely and detailled, they like to provide this information to you, because they feel, that somebody really likes to solve their problem. But because they are not aware that (and which) information is needed for the resolution of an issue, they tend to provide no information at all.

So the goal is to have a list of things, which have to be provided by the reporter of an issue along with the issue. So in the end an issue report would look like this:

“At about 1:40pm on September 23st 2009 I requested the page http://www.abc.foo/a/b.html; I received only a damaged page, with some pictures missing. When I tried to login using my credentials (my username is hbt85), I received an internal server error page; I used the firefox browser installed on my corporate computer.”

Altough this report is very brief (and probably doesn’t much time to report), it contains valuable information, so a system administrator can immediately start to look for the problem and has a realistic chance to find traces of the described issue (in the logfiles, in system dump or in the application monitoring). Because it contains the following important information:

  • Who is the reporter? (Not only the eMail adress or the full name of the user, but also by providing the username in the affected system)
  • What system is affected? (given by the exact hostname)
  • When does the issue occur? (time and date)
  • What has the reporter done and what were the effects of it?

So the most time consuming task of every support is to qualify every incoming request to such a level, that a qualified guy can take a look at the system to identify and fix the issue. In the way to qualify such problems it very often turns out that the cause of the problem lies on the user side, being it either missing training on the system resulting in wrong usage, missing or incorrect documentation or just errors on the user-site. In our example it could be that the user just used a wrong URL, he should have used “http://www.abc.foo/a-new/b.html&#8221;, but he missed the mail announcing this change.

So a major job of every support organisation is to have a prepared, up-to-date list of relevant information, which are needed to resolve an issue. So if you are in a support organisation, provide a list of information pieces, which must be provided by a issue reporter. And if you’re a experienced user and want to report an issue and you want it really fixed, provide as many information as possible. There is no “too much information” regarding a specific problem.

Meta: new Job, Ignite 2009

Although the times seem to be hard with the financial crisis, which hit many companies and decreased IT budgets, I decided to leave my old job. And since October 1st I am employed as an Solution Architect at Day Software and will be working in Frankfurt am Main (Germany). At the moment I am in Basel at the Barfüsserplatz office, get to know the people (hey, you really rock!) and the products. So I will be able to cover also the latest versions in this blog.

I will also visit the customer summit this year in Zurich (only thursday), meeting people and learning about our prodcuts 🙂 So see you there!

Traversing the content hierarchy

When you play around with websites, you often get a good feeling how much “engineering” work and thoughts are put into the site. Think of things like SQL injection and shell escape injection, which were a problem a few years ago. If you encounter today a site which is vulnerable to such a problem, it’s either a problem of the budget (which is a lame excuse because modern frameworks avoid such problems) or the skill of the developer. An up-to-date site isn’t affected anymore by such problems.

A problem, which is clearly visible, but often not known to application developers and architects is the content structure, which is exposed to the user by the URL. Consider the following small example for a simple content structure:

/content/brand/en/home

for the startpage of a CQ-based website. The “home”-handle is the startpage and is thus called via

www.example.com/content/brand/en/home.html

So, where’s the problem? Well, most templates provide a kind of HTML representation of their content. So let’s try

www.example.com/content/brand/en.html

maybe also a structure handle such as the language node (which is often just used to differentiate between languages) does also provide a HTML representation of its content; so it could just render its child nodes as a dotted list.

So, what’s then? Does it harm, if you reveal, that you provide beside english also chinese content? No, it doesn’t. Most times it doesn’t. But when you already have fresh content ready, but not yet linked? It would appear in such a list. Or if you have “hidden content”, functions which are known only to a small group of people? Things, which aren’t secured by authentication and authorization. Suddenly someone has found your private data and could make use of it.

The trash can for functionality is often a folder named “tools”; developers tend to place everything there which doesn’t fit well into any other category. So you can find there contact forms, search functionality and other stuff. So what happens if you call

www.example.com/content/brand/en/home/tools.html

Does it also your show unused/crappy/new functions, which aren’t used in the website, but are still there? Because for convenience some developer thought, it would be cool to have all tools listed without major hassle (1 bookmark instead of 10). Bad idea, you just showed all your available tools to someone, who shouldn’t see them.

So check you your site, that strucutre nodes, which are only used to structure your content, cannot be rendered at all or don’t reveal any information, which could be useful for an attacker. Either return an empty page or (suggested) return the HTTP statuscode “403” (access denied). Don’t reveal data when it isn’t necessary. A well-engineered site also takes care of such “attacks” and doesn’t reveal any data which could be of use for a potential attacker.

I’ve already done such tests on several CQ-based websites and found (beside some other things) a monitoring page (containing version information of used libraries) and also a hidden webspecial which was dedicated to a member of the webteam, heading for another location (hi, Katrin!). All of these information were public viewable (on a major corporate website!) just by playing around with path names and following then links.

Disk usage

Providing enough free disk space is one of the most urgent task for every system administrator. Applications often behave strange when the disks runs out of free space and sometimes data gets corrupt if the application is badly written.

Under Unix the 2 tools to determine the usage of the disk are “du” (disk usage) and “df” (disk free). With “du” you can determine the amount of space a certain directory (with the files and subdirectories in it) consume. “df” shows on a global level the amount of used and free disk space per partition.

Given the case, that you give the same directory (which is also a mountpoint) to both “du” and “df” as parameters. This directory contains a full CQ application with content and versions. You will probably get different results. When we did it, “df” showed about 570 gigabyte used disk space, but “du” claimed, that only 480 gigabyte are used. We checked for hidden directories, open files, snapshots and other things, but the difference of about 90 gigabyte remained.

This phenomenon can be explained quite easy. “du” accumulates the size of files. So if a file has 120 bytes in size, it adds 120 to the sum. “df” behaves differently, it counts block-wise, which are the smallest allocation unit of a unix filesystem (today most blocks are 512 bytes large by default). Because the only one file per block is possible, the 120-byte file uses a full block, leaving 392 bytes unused in that block.

Usually this behaviour is not apparant, because the number of files is usually rather small (a few to some ten thousand) and they are large, so the unused parts are at max 1 percent of the whole file size. But if you have a CQ contentbus with several hundert thousands of files (content + versions) with a lot of small files, this part can grow to a level, where you’d ask your system administrator, where the storage is gone.

So dear system administrator, there’s no need to move your private backup off the server, just tell the business, that their unique content structure needs a lot of additional disk space. 🙂

User administration on multi-client-installations

Developing an application for a multi-client-installation isn’t only a technical or engineering quest, but also reveals some question, which affect administration and organisationial processes.

To ease administration, the user accounts in CQ are often organized in a hiearchy, so that users which are placed higher in the hierarchy, can administrate user which are lower in the hierarchy tree below them. Using this mechanism a administrator can easily delegate the administration of certain users to other users, which can also do adminstrative works for “their” users.

The problem arises when a user has to have rights in 2 applications within the same CQ instance and every application should have its own “application administrator” (a child node to the superuser user). Then this kind of administration is no longer possible, because it is impossible to model a hierarchy where neither application administrator user A has a parent or child relation to application administration user B nor A and B are placed in the hierarch higher than any user C.

I assume that creating accounts for different application but the same person isn’t feasible. That would be the solution which the easiest one from an engineering point of view, but this does contradict the ongoing move not to create for each application and each user a new user/password pair (single sign on).

This problem imposes the burden of user administration (e.g assigning users to groups, resetting passwords) to the superuser, because the superuser is the user, which is always (either by transition or directly) parent to any user. (A non-CQ-based solution would be to handle user related changes like password set/reset and group assignment outside of CQ and synchronize these data then into CQ, e.g. by using a directory system based on LDAP.)

ACLs, access to templates and workflows should be assigned only using groups and roles, because these can be created per application. So if an application currently is based on a user hierarchy and individual user rights it’s hard to add a new application using the same user.

So one must make sure, that all assignments are only based on groups and roles, which are created per application. Assigning individual rights to a single user isn’t the way to go.

Being a good citizen in multi-client-installations

Working in a group requires a kind of discipline from people, which some are not used to. I remember to a colleague, who always complained about his office mate who used to shout at the phone, even in normal talks. If people work together and share ressources, everyone expects to be cooperative and not to trash the work and morale of their team members.

The same applies to applications; if they share ressources (because they may run on the same machine), they should release the ressources if they are no longer needed, and should only claim ressources, if they’re needed at all. Consuming all CPU because “the developer was to lazy to develop a decent algorithm and just choose the brute-force solution” isn’t considered a good behaviour. It’s even harder if these applications are contained within one process, so a process crash not only affects application A, but also B and C. And then it doesn’t matter, that these are well-thought and perfectly developed.

So if you plan to deploy more than one appliction to a single CQ instance, you should take care, that the developers were aware of this condition and they had it in their mind. Because the application does no longer control the heap usage on its own (on top of the heap consumption of CQ itself), but must share it with other applications. It must be programmed with stability and robustness in mind, because unknown structures and services may change assumptions about timing and sizes. And yes, a restart also affects the others. So restarting because of application problems isn’t a good thing.

In general an application should never require a restart of the whole JVM; only when it comes to necessary changes to JVM parameters, it should be allowed. But all other application specific settings should be changable through special configuration templates which are evaluated during runtime, so changes are picked up immediately. This even reduces the amount of work for the system administrator, because changing such values can be delegated to special users using the ACL mechanism.