“Concurrent users” and performance tests

A few years ago when I was still working in application management of a large website we often had the case, that the system was reported to be slow. Whenever we looked into the system with our tooling we did not found anything useful, and we weren’t able to explain this slowness. We had logs which confirmed the slowness, but no apparent reason for it. Sadly the definition of performance metrics was just … well, forgotten. (I once saw the performance requirement section in the huge functional specification doc: 3 sentences vs 200 pages of other requirements.)

It was a pretty large system and rumors reported, that it was some of the biggest installations of this application. So we approached the vendor and asked how many parallel users are supported on the system. Their standard answer “30” (btw: that’s still the number on their website, although they have rewritten the software from scratch since then) wasn’t that satisfying, because they didn’t provide any way to actually measure this value on the production system.

The situation improved then a bit, got worse again, improved, … and so on. We had some escalations in the meantime and also ran over months in task force mode to fight this and other performance problems. Until I finally got mad, because we weren’t able to actually measure the how the system was used. Then I started to define the meaning of “concurrent users” for myself: “2 users are considered concurrent users, when for each one a request is logged in the same timeframe of 15 minutes”. I wrote a small perl script, which ran through the web server logs and calculated these numbers for me. As a result I had 4 numbers of concurrent users per hour. By far not exact, but reasonable to an extent, that we had some numbers.

And at that time I also learned, that managers are not interested in the definitions of a term, as long as they think they know what it means. So actually I counted a user as concurrent, when she once logged in and then had a auto-refresh of a page every 10 minutes configured in her web browser. But hey, no one questioned my definition and I think the scripts with that built-in definition are still used today.

But now we were able to actually compare the reported performance problems against these numbers. And we found out, that it was related sometimes (my reports showed that we had 60 concurrent users while the system was reported to be slow), but often not (no performance problems are reported but my reports show 80 concurrent users; and also performance problems with 30 reported users). So, this definition was actually useless… Or maybe the performance isn’t related to the “concurrent users” at all? I alread suspected that, but wasn’t able to dig deeper and improve the definition and the scripts.

(Oh, before you think “count the active sessions on the system, dude!”: That system didn’t have any server side state, therefor no sessions. And the case of the above mentioned auto-reload of a the startpage of a logged in user will result in the same result: She has an active session. So be careful.)

So, whenever you get a definition of “the system should support X concurrent users with a maximum response time of 2 seconds”, question “concurrent”. Define the activities of these users, build according performance tests and validate the performance. Have tools to actually measure the metrics of a system hammered with “X concurrent users” while these performance tests. Apply the same tooling to the production. If the metrics deliver the same values: cool, your tests were good. If not: Something’s wrong: either your tests or the reality…

So as a bottom line: Non-functional requirements such as performance should be defined with the same attention as functional requirements. In any other case you will run into problems. And

Meta: upcoming events 2012

A new posting with some topics, where I don’t tell you something about CQ 5.5, but about some upcoming events. These might be relevant for you, because they both are about CQ5.

First, there is the adaptTo() conference in Berlin, Germany from September 26th to 28th. Last year there this conference was a huge success, and I got very positive feedback from colleagues and coworkers about it. Some people from CQ5 engineering will be attending this 3-day conference as well, so expect cool technical topics and a lot of QA sessions (official and unofficial ones). As usual the get-together will be most interesting part. So if you are working in the CQ5 business and you are interested in more technical topics: this is the event especially for you!
Sadly I cannot make it this year, because my schedule is too tight.

And as a second event there is “Evolve12” in San Diego, California (from Oct 15th to 17th). As this conference is the first one, I don’t have too much information about its character, but having Kevin Cochrane and David Nüscheler on the list of speakers, this will be a little bit more strategic and marketing oriented 🙂 Although 2 of the 3 tracks claim to be technical … well, let’s see.
This event is accompanied by a special request on the google groups list (which is btw a must for every CQ5 developer, check http://groups.google.com/group/day-communique?hl=en): Everybody should nominate her personal “TOP CQ Google Group Contributor”, and I encourage you to do so.
Currently I don’t plan to attend this conference as well, as it would mean to be at least a week off (and again, my work schedule is tight). But let’s see…

Creating MBeans for your CQ5 application

JMX is the de-facto standard for monitoring java processes and applications. A lot of monitoring systems have the ability to consume these data.

By default CQ 5.5 has a number of MBeans, which offer runtime information about internal state. Most interesting ones are the MBeans about the repository state and the replication MBeans. But it isn’t hard to create your own MBeans, so you provide information about the internal state of your application to the monitoring system; or you can monitor resources which are critical to your application and use case.

In Cq5 we are working in a OSGI environment, so we will use one of my favorite patterns, the OSGI whiteboard pattern. We will use the JMX-Whiteboard bundle of the Apache Aries project to register services to JMX. Also that implementation is very short and understandable and shows the power of the whiteboard pattern. (I already had a short blog entry on this last year.)

In this example I want to demonstrate it on an already existing counter, the total number of requests handled by sling.It requires CQ 5.5, where the JMX whiteboard bundle is already deployed by default; but if you install the JMX Whiteboard bundle yourself, you can also use older versions of CQ5.

And it goes like this:

1. Build JMX interface

That’s a regular Java interface, which ends with the “MBean” string. This interface provides the methods and getters/setters you want to expose to JMX.

package sample.jmx;
public interface RequestCounterMBean {
int getRequestCounter();
}

2. Add the implementation class

As next step we have the implementation class. To make it easy, I implement the counter as a servlet filter, which just increments an counter.

package sample.jmx;

import java.io.IOException;

import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;

import org.apache.felix.scr.annotations.Component;
import org.apache.felix.scr.annotations.Property;
import org.apache.felix.scr.annotations.Service;

// add OSGI exports for all implementing classes
@Service
@Component(immediate=true,metatype=true)
@Properties({
  @Property(name="jmx.objectname",value="sample.jmx:id='requestCounter'"),
  @Property(name="pattern",value="/.*")
})
public class RequestCounter implements RequestCounterMBean, Filter{

  private int count = 0;

  public int getRequestCount() {
    return count;
  }

  public void destroy() {
    // nothing to do
  }

  public void doFilter(ServletRequest request, ServletResponse response, 
      FilterChain chain) throws IOException, ServletException {
    count++;
    chain.doFilter(request, response);
  }

  public void init(FilterConfig arg0) throws ServletException {
    // nothing to do
  }
}

3. That’s it.

The magic lies in the Annotations: @Service without any other parameter registers this service as implementation of all interfaces we implement on the class, in this case as ServletFilter and as MBean. So when this service is started, it is registered as an implementation of the RequestCounterMBean interface; as this class name ends with “MBean”, it is tracked by the JMX whiteboard service and then registered to all available MBean servers und the Objectname “sample.jmx:id=’requestCounter'”.

Now you should be able to export any interesting information of your CQ5 application as a MBean. Of course there are other ways to export data via JMX, but I consider this one as a very simple approach, which is very easy to understand and use.

Testing with the “admin” user

One of the most common patterns when working with CQ5 in a demo or development environment is the use of the “admin” account for all kind of purposes. Be it content creation, administering and also testing. The best example is David Nüscheler, who regularly uses the admin account in his demos 🙂

This account has a big advantage, when it comes to demos: You don’t have any limits and you can change configurations on the fly without logging in twice with different accounts. Great deal.

But I see also many developers and testers, which execute their tests using the “admin” account. While on the pure functional stuff it doesn’t look like that important, what user is used (“I only check if I get the right result, and this result is not influenced by any ACL!”), there are some hidden drawbacks, which come up every now and then.

First, the admin user doesn’t have any ACL, therefor any code within CQ5, which has an admin session tied to the request, has all possibilities to read and write in the repository. If you have a team of not-so-experienced developers, you might get code, which does some things not in a way you expect. For example it tries to update tags, when you modify a page. On a developer test (developers usually work with the admin user) this isn’t a problem. A tester working as admin also doesn’t see any issues. Unless your code hits production. Then an editor which updates content will face an internal server error, because she isn’t allowed to modify tags. (The experienced developer will use a admin-session for this, which doesn’t show this error, but the design approach is bad as well …)

And as a second aspect, code accessing the repository using an admin session has a different runtime performance than a session by other users. This is mostly caused by direct shortcuts in the code and by the avoidance of ACL checks. This runtime performance effect is mostly visible in performance tests, when requests issued with the admin account are faster than requests made by a regular user account. So if you do performance tests, build a set of regular users and use them instead of the admin account.

So, if you test, don’t test only with “admin”. Use different users as well. Because out there most editors work with a limited account.

CQ5 requesting itself

I recently came again over a developer behaviour, which is quite common: Check how a certain action can be performed, and then reproduce it in code.

For example: How do I start a certain workflow in CQ5 via code? The easiest way to find this out is to perform this action within the UI and then replay the request. That means sending a POST request to localhost:4502/etc/workflows/instances with a number of POST parameters. A developer might then start thinking about using httpclient to send a HTTP request to CQ5 itself to start a certain workflow. But because this request needs authentication, the developer requires some user with enough permissions to start the workflow, in the worst case “admin” is hardcoded as password 😦

This approach has several severe drawbacks:

The port might not be 4502 …
A password must be provided, because on authoring all ressources are protected by ACLs.
You need to parse the returncode to check if something went wrong.

In the end that’s pretty much work to do. Just because the developer wasn’t aware of the WorkflowService. Same applies to many other features of the UI.

So one thing is quite important to know: Every feature of the UI can be accessed also by other means, without remote control over HTTP. In most (all?) cases there are services available, which allow direct access to that functionality via a regular Java API. So if you are a developer looking to access some piece of functionality, lookup the API docs and the Felix console. You’ll certainly find that feature, especially if it’s that prominent as the WorkflowService. If you are an admin, check your installation and deployment guidelines and make sure, that none of the 3 items above are required.

But … there is one (and only one) valid requirement to send a request from CQ5 to itself: If you want to get hold of a rendered resource. For example you need to build a ZIP archive out of pages beneath a page. For this you need these pages rendered. One way would be to reconstruct the way sling request processing works. The other and easiest way is to just do a HTTP request and do everything like it is supposed to work. But then you need to provide username/password and the port… But since CQ 5.4 there is a solution for this problem. You can use the SlingRequestProcessor service to inject a HttpServletRequest and HttpServletResponse object, plus a ResourceResolver (which is already authenticated!). You need to provide your own mock implementations of these, plus a method, which returns the body of the response.

Disk space consumption

Disk space got cheap today, my favorite dealer currently sells 2 Terabyte disks for less than 100 Euro. Of course this price does not cover redundancy, backup, and of course enterprises don’t buy consumer hardware… Though, even for them 1 terabyte of disk space should be affordable. Well, sometimes you don’t get that impression. You have to fight for funding each and every gigabyte. So, sometimes disk consumption is a real problem…

So, let’s discuss some causes for disk consumption with CRX and TarPM. The TarPM consists of 4 parts:

the tar files
the journal
the index
the datastore

The tar files

The TarPM is an append-only storage and therefor every action done on the repository; see the documentation on the docs. In the default configuration it consumes 100 megabyte for a single version file. So this shouldn’t be a problem in any case (even if you don’t do clustering, you can neglect the additional disk space consumption).

The index

The index is Apache Lucene based, and a has a add-and-merge strategy; while additional segments are added as dedicated files, these segments are merged in times of lowlevel acitivities to the main index. And when content is removed (and therefor shouldn’t be available anymore in the index), the respective parts of the index are deleted. So the index shouldn’t give you any problems. In case the index grows too large, you can tweak the indexing.xml, but for normal usecases that shouldn’t be necessary.

The datastore

As the TarPM the datastore (see the docs) is an append-only storage in the first place. Large binaries are placed there and referenced from the TarPM; but there are no references from the datastore to the TarPM, so when a node is removed in the tarPM, this reference is removed, but not the datastore object. That’s the case because it might be referenced from other nodes as well, but at this point this is not known. So we might have objects in the datastore which are not referenced anymore. The cleanup these the Datastore Garbage Collection checks all references to the datastore and removes the objects, which are not needed any more.

What you should be aware of

If you need to reclaim storage, consider this:

Versioning: Just removing pages is often not sufficient to reclaim the space they consume, because they are still kept as versions.
Large childnode lists can affect the size of the TarPM itself quite heavily: Every addition of a new child creates a new copy of the parent node (including a childnode list, which continously grows). So beside the performance impact of such large lists there is also an impact on disk consumption.
As a recommendation run the TarOptimizer on a daily basis, and the Datastore Garbage Collection on a monthly (or even quarterly) basis. First run the TarOptimizer, then the Datastore Garbage Collection.

CQ 5.5: Changes to the startup

Since already a month CQ 5.5 is out. See here for the major changes brought with this release.

Amongst the hundreds of new features I would like to point out a single one, which is probably the most intersting features for admins and operation people.
With CQ 5.5 the quickstart does no longer start a servlet engine with 2 webapplications deployed in it (crx.war for the CRX and launchpad.war for the OSGI container including Sling and CQ5 itself). But as now CRX has been adapted work inside an OSGI container it is possible to drop the artifical differentiation between CRX and the other parts of the system, but handle them alike inside Apache Felix. The same applies to the CQSE servlet engine; it’s now an service embedded into OSGI (so the server.log file is gone). So the quickstart starts the Felix OSGI container which then takes care of starting the repository services, the servlet engine and all other services introduced by Sling and CQ5. This streamlined the startup process a lot. And — finally — there is only 1 place where you need to change the admin password.

Speaking of the start script: It has changed also. Essentially it has been rewritten, and a lot of old cruft; when I started with Communique 3.5 the basic structure of the serverctl script was the same as of CQ 5.4. It is much easier to understand, but the out-of-the-box version has a major flaw, as it doesn’t redirect the stdout and stderr to dedicated logfiles, but let them print on the console. This at least annoying, but more often just a problem; for example the threaddumps of a Sun/Oracle JVM are printed to stdout and are lost then. Same applies to the garbage collection logs (if not redirected to a specific file).

So instead of

java $CQ_JVM_OPTS -jar $CQ_JARFILE $START_OPTS &
        echo $! > "conf/cq.pid"

use these lines:

(
        (
                java $CQ_JVM_OPTS -jar $CQ_JARFILE $START_OPTS &
                echo $! > "conf/cq.pid"
        )  | /usr/sbin/rotatelogs2 stdout.log 5M 2>&1
) &

It requires the rotatelogs2 utility (depends a bit on your system; might also be named as “rotatelogs”) shipped with the Apache HTTPD webserver (SuSE Linux: apache2 package), and it enures that the logs are rotated correctly when their configured size (here: 5 megabytes) are reached.

(No need to report this on Daycare, it’s already submitted.)

CQ5 init script

Nearly all Unix flavors have the notion of so-called initscripts, which are executed at startup (or when you switch the runlevel) and which start all required services or adapt settings. Usually all daemon and server processes are started via such initscripts. CQ5 doesn’t provide such a script out-of-the-box, but a rather complex shell script called “serverctl”, which performs many steps you would expect from a initscript:

assembling parameters
setting environment variables
start/stop/status of the server process (start and stop are also available a separate scripts for ease of use)

This would easily qualify this script as a initscript, but it lacks one important task: Changing the uid for the to-be-started process, it doesn’t run as root user. While from a purely functional perspective this isn’t important, every sysadmin will tell you, that you shouldn’t run processes as root, when they don’t require such permissions to perform their tasks. And the security guys won’t even let you golive with such a setting. So if you write an init script, you should add this functionality there.

And one last plea: Do not start the java process directly in the initscript and assemble the commandline yourself, but call the start&stop scripts. Especially if you are not that familiar with the serverctl script. So, a simple initscript could look like this:

#!/bin/sh
CQ5_ROOT=/opt/cq5
CQ5_USER=cq5

######## 
SERVER=${CQ5_ROOT}/crx-quickstart/server 
START=${SERVER}/start 
STOP=${SERVER}/stop 
STATUS="${SERVER}/serverctl status" 

case "$1" in 
  start) 
    su - ${CQ5_USER} ${START} 
  stop) 
    su - ${CQ5_USER} ${STOP} 
  status) 
    su - ${CQ5_USER} ${STATUS} 
  *) 
    echo "Unknown argument $1" 
esac

Meta: adaptTo, CQ Blueprints

Last week the adaptTo event happened in Berlin/Germany, which was a smaller conference related to Apache Sling, Jackrabbit and also CQ5 (or WEM, how it’s called now in Adobish). Sadly I wasn’t able to attend for family reasons, but you can download all slides on the website of pro!vision, sponsor of that conference. I especially like the presentation by Jukka (Jackrabbit/CRX performance tuning), as he shows many design principles of Jackrabbit, which have an influence on performance. And every developer and application designer working with sling should have a look at the presentation by Achim Schermuly-Koch, as he covers a very interesting security and also performance pattern, which affects many sites.

Another intesting site I recently fell about is CQ Blueprints, which has the claim to collect best practices for CQ 5.2 till CQ 5.4; I browsed through and found some interesting ideas, but also some tips, which are not “best practices” in my opinion. Haven’t found the time to write constructive comments there.

A short outlook: I plan to write short posts about “init scripts for CQ5” and “dispatcher invalidation”. The latter topic is rather intesting, as I recently adviced several people to bypass the regular invalidation, while in the last years there was never a need to do so. So, stay tuned.

Is my CRX performance I/O bound?

In the last months I’ve seen many situations, where the CQ5 performance was poor, while the CPU usage was quite low (on a 16 core machine 15 cores were idling). The next assumption, that the I/O is the problem, wasn’t confirmed by the tools like top, iostat or sar, because they showed an I/O wait of 3-4%, which indicates that there’s I/O, but the system is not loaded with it.

Further investigation using threaddumps and profiler showed, that there was indeed a I/O problem. Because they showed a lot of blocked threads within the JVM, which wanted to do I/O. So how goes this together?

Basically it can be easily explained. Your operation system is optimized to handle multiple parallel threads/processes, which want to do I/O. If there are too much of them or the I/O subsystem is too slow, the I/O wait ratio will increase. But in the CRX case it’s a bit different. Because of its internal structures CRX requires locking to synchronize its write actions to disk (reads are done in parallel). So for many operations (like updating metadata, writing journal log etc) only 1 thread can actually write to the filesystem, all others are waiting for this thread to finish its write action. For the operating system this looks like a single write action, which can be handled quite easily without the I/O wait skyrocketing.

So if you suspect that you have an I/O problem, use a profiler or threaddumps (or /crx/diagnostic/prof.jsp), the data which are displayed by top don’t tell the whole truth.