Is my CRX performance I/O bound?

In the last months I’ve seen many situations, where the CQ5 performance was poor, while the CPU usage was quite low (on a 16 core machine 15 cores were idling). The next assumption, that the I/O is the problem, wasn’t confirmed by the tools like top, iostat or sar, because they showed an I/O wait of 3-4%, which indicates that there’s I/O, but the system is not loaded with it.

Further investigation using threaddumps and profiler showed, that there was indeed a I/O problem. Because they showed a lot of blocked threads within the JVM, which wanted to do I/O. So how goes this together?

Basically it can be easily explained. Your operation system is optimized to handle multiple parallel threads/processes, which want to do I/O. If there are too much of them or the I/O subsystem is too slow, the I/O wait ratio will increase. But in the CRX case it’s a bit different. Because of its internal structures CRX requires locking to synchronize its write actions to disk (reads are done in parallel). So for many operations (like updating metadata, writing journal log etc) only 1 thread can actually write to the filesystem, all others are waiting for this thread to finish its write action. For the operating system this looks like a single write action, which can be handled quite easily without the I/O wait skyrocketing.

So if you suspect that you have an I/O problem, use a profiler or threaddumps (or /crx/diagnostic/prof.jsp), the data which are displayed by top don’t tell the whole truth.