Understanding the “Oak Repository Statistics” MBean

In the last releases AEM has been greatly enhanced to provide information which are suitable for health detection. Especially Oak provides a huge amout of MBeans which can be monitored. But sometimes they are a bit hard to understand. Based on some ongoing activities I digged through the “Oak Repository statistics” MBean and found it quite useful, even for some basic understanding and analysis.

I did this analysis and the screenshot on AEM 6.5, but this MBean is present at least since AEM 6.1 (probably even 6.0) and its content hasn’t changed much.

When you access this MBean the top of the page looks like this (this instance has just started):

There are a number of values collected, and presented for a number of times:

per second: The raw value in each second for the last minute.
per minute: The aggregated value on a minute basis for the last hour
per hour: The aggregated value on a hourly basis for the last day
per day: the aggregated value on a daily basis

The aggregation differs based on the type of the metric:

Gauge: This is a simple value, which is not further processed. When values of types must be aggregated, an average is calculated.
Counter: This a number, which can be accumulated. When values of type counter must be aggregated, they are summed up.

Attribute Name	Type	Description
SessionCount	Gauge	The number of JCR sessions, which are currently open
SessionLogin	Gauge	The number of sessions opened within that time
SessionReadCount	Counter	The number of read operations in the JCR (over all sessions)
SessionReadDuration	Gauge	The total time spent in read operations (nanoseconds)
SessionReadAverage	Gauge	the average duration for read operations (SessionReadDuration divided by the number of reads)
SessionWriteCount	Counter	The number of write operations in the JCR (over all sessions). Be aware that session.refresh() is also counted as write operation!
SessionWriteDuration	Gauge	Total time spent writing to sessions in nanoseconds
SessionWriteAverage	Gauge	the average duration for write operations (SessionWriteDuration divided by the number of writes)
QueryCount	Counter	The number of JCR queries executed
QueryDuration	Gauge	The total time spent in JCR query operations (miliseconds)
QueryAverage	Gauge	the average duration for queries (QueryDuration divided by the number of writes)
ObservationEventCount	Counter	The number of observation events delivered to all listeners
ObservationEventDuration	Gauge	The total time spent processing events by all observation listeners in nanoseconds
ObservationEventAverage	Gauge	the average duration spent processing observation events (ObservationEventDuration divided by the number of events)
ObservationQueueMaxLength	Gauge	The maximum length of the JCR Observation Queue; in newer Oak versions this queue does no longer exist, and then the value is -1

This measurement is done to limit the amount of data which needs to be stored. And this data is stored within the JVM inside the Oak bundles; that means that any restart of the JVM or restart of the Oak bundles will reset these values. If you want to persist these values you need to read them via JMX and store them.

Ok, what can you do with all this data? Well, it can help you to answer many questions. For example you can find out very easy, if you have a session leak.Because then numbers ion the SessionCount attribute always increase over time. It’s also interesting to find out what is happening within your system when it’s completely idle. Are there repository writes which you are not expecting? Queries every few seconds?

If you are investigating performance issues, or if you want to avoid them, you should have a look into this MBean.

Update: Fixed the only link in this post. Thanks Oswald for reporting!

“We have an urgent performance issue” (part 2)

As a reaction of the last post I got the question by Oswaldo about specific recommendations on performance. Actually, there are a lot. But that’s material for another blog post 🙂 or skip to the bottom of this post.

Instead I want to give you a recommendation on how to handle situations when you did not have time nor capacity you can spend on thinking about performance and response times. But as an experienced technical leader you know that at some point this question will arise for sure. You might get a few hours to spend on that question, but how do you spend it most efficiently?

Clearly not for performance optimization! Because it’s not enough time to analyze and improve substantial parts of the application. And tomorrows changes might render these improvements useless…

Instead I would recommend you to spend this time in communication and building rapport with people who can help you in case such a performance problem arises. Get in contact with the operations people which are operating your system and application. Understand how they work and what tools they use. Understand how they can help you in case of performance issues, what information they can provide to you. Ask for an account on their monitoring system, just to demonstrate interest in their work and problems. And potentially give them some tips what they can additionally do to improve the quality of the information (for example asking if they can also provide the raw data and not only the visualization based on aggregated data). Or show them some hints how they improve their work with your application.

The biggest value in that activity is the fact, that in case the dreaded performance issue is noticed on an exec level, you already know who to talk to. You know a bit how the others are working and how you can help them. As a tech lead it’s then much easier to ask for logfiles, traffic patterns, CPU usage graphs and I/O latencies, threaddumps etc. You know upfront what information it operations already collects by default. You might have direct access to a monitoring system to get more information. You can even get a warning from the ops people in advance that some real big escalation is imminent. For me this is the best you can get if you have just a few hours to spend.

You might ask why is that important. Because it reduces the TTAD (time to actionable data) dramatically in case of such performance issues. You know who to get on the phone and into calls to start investigation. You already know what information is already available or you can even access it directly. You can report “We are analyzing data and can come up with first suggestions within the day” instead of “we are talking to IT and see how they can support us to get data”.

That’s much more important than spending some hours on random performance tuning. And in case you ever run into performance issues, these hours are one of the best investment you made in the whole project.

(And as random recommendation to improve AEM request rendering times: Disable the MobileRedirectFilter (PID: com.day.cq.wcm.mobile.core.impl.redirect.RedirectFilter) by setting the configuration parameter “redirect.enabled” to “false”. In the age of responsive websites it’s purpose is no longer given. And under load its performance impact can be significant.)

4 problems in projects on performance tests

Last week I was in contact with a collague of mine who was interested in my experience on performance tests we do in AEM projects. In the last 12 years I worked with a lot of customers and implemented smaller and larger sites, but all of them had at least one of the following problems.

(1) Lack of time

In all project plans time is allocated for performance tests, and even resources are assigned to it. But due to many unexpected problems in the project there are delays, which are very hard to compensate when the golive or release date is set and already announced to or by senior management. You typically try to bring live the best you can, and steps as performance tests are typically reduced first in time. Sometimes you are able to do performance tests before the golive, and sometimes they are skipped and postponed after golive. But in case when timelines cannot be met the quality assurance and performance tests are typical candidates which are cut down first.

(2) Lack of KPIs

When you the chance to do performance tests, you need KPIs. Just testing random functions of your website is simply not sufficient if you don’t know if this function is used at all. You might test the functionality which the least important one and miss the important ones. If you don’t have KPIs you don’t know if your anticipated load is realistic or not. Are 500 concurrent users good enough or should it rather be 50’000? Or just 50?

(3) Lack of knowledge and tools

Conducting performance tests requires good tooling; starting from the right environment (hopefully comparable sizing to production, comparable code etc) to the right tool (no, you should not use curl or invent your own tool!) and an environment where you execute the load generators. Not to forget proper monitoring for the whole setup. You want to know if you provide enough CPU to your AEM instances, do you? So you should monitor it!.

I have seen projects, where all that was provided, even licenses for the tool Loadrunner (an enterprise grade tool to do performance tests), but in the end the project was not able to use it because noone knew how to define testcases and run them in Loadrunner. We had to fall back to other tooling or the project management dropped performance testing alltogether.

(4) Lack of feedback

You conducted performance tests, you defined KPIs and you were able to execute tests and get results out of them. You went live with it. Great!
But does the application behave as you predicted? Do you have the same good performance results in PROD as in your performance test environment? Having such feedback will help you to refine your performance test, challenging your KPIs and assumptions. Feedback helps you to improve the performance tests and get better confidence in the results.

Conclusion

If you haven’t encountered these issues in your project, you did a great job avoid them. Consider yourself as a performance test professional. Or part of a project addicted to good testing. Or you are so small that you were able to ignore performance tests at all. Or you just deploy small increments, that you can validate the performance of each increment in production and redeploy a fixed version if you get into a problem.

Have you experienced different issues? Please share them in the comments.

TarPM lowlevel write performance analysis

The Tar Persistence Manager (TarPM) is the default persistence mechanism in CQ. It stores all data in a filesystem and is quite good and performant.
But there are situations where you would like what actually happens in the repository:

When your repository grows and grows
When you suffer a huge number of JCR events
When you do performance optimization
When you’re just interested, what happens under the hoods

In such occasions you can use the “CRX Change history” page on the OSGI console; if you choose the “details” link to the most recent tar file, it will show you the details of each transactions: the name of the changed nodes and a timestamp.

In this little screenshot you can see, that first some changes to the image node have been written; immediately afterwards the corresponding audit event has been stored in the repository.

I use this tool especially when I need to check the data which is written to the repository. Especially when multiple operations run concurrently and I want need to monitor the behavior of the some application code I don’t know very well, this screen is of huge help. I can find out, if some process really writes that much data as anticipated. Also the amount of nodes written in a single transaction shows, if batch saves are used or if the developer preferred lots of individual saves (which have a performance penalty).
And you can really check, if your overflowing JCR event queue is caused by many write operations or by slow observation listeners.

So it’s a good tool if you assume that the writes to your repository should be quicker.

Ways to access your content with JCR (part 1)

If you are a developer and need to work with databases, you often relay on the features your framework offers you to get your work done easily. Working directly with JDBC and SQL is not really comfortable, writing “SELECT something FROM table” with lots of constraints can be tedious …

The SQL language offers only the “select” statement to retrieve data from the database. JCR offers multiple ways to actually get access to a node:

Each of these methods serve for different purposes.

session.getNode(path) is used, when you know exactly the path of a node. That’s comparable to a “select * from table where path = “/content/geometrixx/en” in SQL, which is a direct lookup of a well-known node/row.
node.getNodes() returns all child nodes below the node. This method has no equivalent in the SQL world, because in JCR there are not only distinct and independent nodes, but nodes might have a hierarchical relation.
The JCR search is the equivalent of the SQL query, it can return a set of nodes. Yes, ANSI SQL 92 is much more powerful, but let’s ignore that for this article, okay?

In ANSI SQL, approach 1 and 3 are both realized by a SELECT query, while the node.getNodes() approach has no direct equivalent. Of course it can also realized by a SELECT statement (likely resolving a 1:n relation), but it highly depends on the structure of your data.

In Jackrabbit (the reference implementation of the JCR standard and also the basis for CRX) all of these methods are implemented differently.

session.getPath(): It starts at the root node and drills down the tree. So to lookup /content/geometrixx/en the algorithm starts at the root, then looks up the node with the name “content”, then looks for a child node named “geometrixx” and then for a child node named “en”. This approach is quite efficient because each bundle (you can consider it as the implementation equivalent of a JCR node) references both its parent and all the child nodes. On every lookup the ACLs on that node are enforced. So even when a node “en” exists, but the session user does not have read access on it, it is not returned, and the algorithm stops.

node.getNodes is even more efficient, because it just has to lookup the bundles of the child node list and filter it by ACLs.

If you use the JCR search, the Lucene index is used to do the search lookup and the bundle information is used to construct the path information. The performance of this search depends on many factors, like (obviously) the amount of results returned from Lucene itself and the complexity of the other constraints.

So, as a CQ developer, you should be aware of that there are many ways to get hold of your data. And all have different performance behaviors.

In the next posting I will explain this case on a small example.

CRX 2.3: snapshot backup

About a year ago I wrote an improved version of backup for CRX 2.1 and CRX 2.2. The approach is to reduce the amount of data which is considered by the online backup mechanism. With CRX 2.3 this apprach can still be used, but now an even better way is available.

A feature of the online backup — the blocking and unblocking of the repository for write operations — is now available not only to the online backup mechanism, but can be reached via JMX.

So, by this mechanism, you can prevent the repository from updating its disk structures. With this blocking enabled you can backup all the repository and then unblock afterwards.

This allows you to create a backup mechanism like this:

Call the blockRepositoryWrites() method of the “com.adobe.granite (Repository)” Mbean
Do a filesystem snapshot of the volume where the CRX repository is stored.
Call “unblockRepositoryWrites()
Mount the snapshot created in step 2
Run your backup client on the mounted snapshot
Umount and delete the snapshot

And that’s it. Using a filesystem snapshot instead of the online backup accelerates the whole process and the CQ5 application is affected (step 1 – 3) only for a very small timeframe (depending on your system, but should be done in less than a minute).

Some notes:

I recommend snapshots here, because they are much faster than a copy or rsync, but of course you can use these as well.
While the repository writes are blocked, every thread, which wants to do a write operation on the repository, will be blocked, read operations will work. But with every blocked write operation you’ll have one thread less available. So in the end you might run into a case, where no threads are available any more.
Currently the UnlockRepositoryWrites call can be made only by JMX and not by HTTP (to the Felix Console). That should be fixed within the next updates of CQ 5.5. Generally speaking I would recommend to use JMX directly over HTTP-calls via curl and the Felix Console.

Link of the day

Today I found an interesting article on ACM Queue, which explains a lot of theory behind performance tuning.

Performance, like any other feature, doesn’t just happen; it has to be designed and built. To do performance well, you have to think about it, study it, write extra code for it, test it, and support it.

Basic performance tuning: Caching

Many CQ installations I’ve seen start with the default configuration of CQ. This is in fact a good decision, because the default configuration can handle small and middle installations very well. And additionally you don’t have to maintain a bunch of configuration files and settings; and finally most CQ hotfixes (which are delivered without the QA) are only tested with default installations.

So when you start with your project and you have a pristine CQ installation, the performance of both publishing and authoring instances are usually very good, the UI is responsive, page load times in the 2-digit miliseconds. Great. Excellent.

When your site grows, when the content authors start their work, you need to do your first performance and stress tests using numbers provided by the requirements (“the site must be able to handle 10000 concurrent requests per second with a maximal response time of 2 seconds”). You either can overcome such requirements by throwing hardware on the problem (“we must use 6 publishers each on a 4-core machine”) or you just try to optimize your site. Okay, let’s try it with optimization first.

Caching is a thing which comes to mind first. You can cache on several layers of the application, be it application level (caches builtin into the application, like the outputcache of CQ 3 and 4), the dispatcher cache (as described here in this blog), or on the users system (using the browser cache). Each cache layer should decrease the number of requests in the remaining caches, so that in the end only the requests get through, which cannot be handled in a cache, but must be processed in CQ. Our goal is to move the files into a cache which is nearest to the enduser; then loading of these files is faster than if the load is performed from a location which is 20 000 kilometers away.

(A system engineer may also be interested in that solution, because it will offload data traffic from the internet connection. Leaves more capacity for other interesting things …)

If you start from scratch with performance tuning, grasping for the low-hanging fruits is the way to go. So you start into an iterative process, which contains of the following steps:

Identify requests which can be handled by a caching layer which is placed nearer to the enduser.
Identify actions, which allows to cache these requests in a cache next to the user.
Perform these actions
Measure the results using appropriate tools
Start over from (1)

(For a more broader view to performance tuning, see David Nueschelers post on the Day developer site)

As an example I will go through this cycle on the authoring system. I start with a random look at the request.log, which may look like this:

09/Oct/2009:09:08:03 +0200 [8] -> GET /libs/wcm/content/welcome.html HTTP/1.1
09/Oct/2009:09:08:06 +0200 [8] <- 200 text/html; charset=utf-8 3016ms
09/Oct/2009:09:08:12 +0200 [9] -> GET / HTTP/1.1
09/Oct/2009:09:08:12 +0200 [9] <- 302 - 29ms
09/Oct/2009:09:08:12 +0200 [10] -> GET /index.html HTTP/1.1
09/Oct/2009:09:08:12 +0200 [10] <- 302 - 2ms
09/Oct/2009:09:08:12 +0200 [11] -> GET /libs/wcm/content/welcome.html HTTP/1.1
09/Oct/2009:09:08:13 +0200 [11] <- 200 text/html; charset=utf-8 826ms
09/Oct/2009:09:08:13 +0200 [12] -> GET /libs/wcm/welcome/resources/welcome.css HTTP/1.1
09/Oct/2009:09:08:13 +0200 [12] <- 200 text/css 4ms
09/Oct/2009:09:08:13 +0200 [13] -> GET /libs/wcm/welcome/resources/ico_siteadmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [14] -> GET /libs/wcm/welcome/resources/ico_misc.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [15] -> GET /libs/wcm/welcome/resources/ico_useradmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [15] <- 200 image/png 8ms
09/Oct/2009:09:08:13 +0200 [16] -> GET /libs/wcm/welcome/resources/ico_damadmin.png HTTP/1.1
09/Oct/2009:09:08:13 +0200 [16] <- 200 image/png 5ms
09/Oct/2009:09:08:13 +0200 [13] <- 200 image/png 17ms
09/Oct/2009:09:08:13 +0200 [14] <- 200 image/png 17ms
09/Oct/2009:09:08:13 +0200 [17] -> GET /libs/wcm/welcome/resources/welcome_bground.gif HTTP/1.1
09/Oct/2009:09:08:13 +0200 [17] <- 200 image/gif 3ms

Ok, it looks like that some of such requests must not be handled by CQ: the PNG files and the CSS files. These files usually never change (or at least change very seldom, maybe on a deployment or when a hotfix is deployed). But for the usual daily work of an content author they can be assumed to be static, but we must of course provide a way that we enable the authors to fetch a new one, when an update to one them occurs. Ok, that was step 1: We want to cache the PNG and the CSS files which are placed below /libs.

Step 2: How can we cache these files? We don’t want to cache them within CQ (that wouldn’t bring any improvement), so remains dispatcher and browser cache. In this case I recommend to cache them in the browser cache for 2 reasons:

These files are requested more than once during a typical authoring session, so it makes sense to cache directly in the browser cache.
Latency of the browser cache is ways lower than the latency of any load from the network.

As an additional restriction which speaks against the dispatcher:

There are no flusing agents for authoring mode, so we cannot use the dispatcher that easily. So in the case of tuning an authoring instance we cannot use the dispatcher cache.

And to make any changes to these files made on the server visible to the user, we can use the expiration feature of HTTP. This allows us to specify a time-to-live, which basically tells any interested party, how long we consider this file up-to-date. When this time is reached, every party, which cached it, should remove it from cache and refetch.
This isn’t the perfect solution, because a browser will drop the file from its cache and refetch it from time to time, although the file is still valid and up-to-date.
But there’s still an improvement, if the browser fetches this files every hour instead of twice a minute (when a page load occurs).

Our prognose is, that the browser of an authoring user won’t perform that much requests on files anymore; this will increase the rendering performance of the page (the files are fetched from the fast browsercache instead from the server), and additionally the load on the CQ will decrease, because it doesn’t need to handle that much requests. Good for all parties.

Step 3: We implement this feature in the apache webserver, which we have placed in front of our CQ authoring system and add the following statements:

<LocationMatch /libs>
ExpiresByType image/png "access plus 1 hour"
ExpiresByType text/css "access plus 1 hour"
</LocationMatch>

Instead of relying on file extensions we specify here the expiration by the MIME-type in these rules. The files are considered to be up-to-date for an hour, so the browser will reload these files every hour. This value should be ok also in case these files are changed once. And if everything fails, the authoring users can drop their browser cache.

Step 4: We measure the effect of our changes using 2 different strategies: First we observe the request.log again and check if these requests appear further on. If the server is already heavy loaded, we can additionally check for a decreasing load and an improved response times for the remaining requests. As a second option we take a simple use case of an authoring user and run it with Firefox’ Firebug extension enabled. This plugin can visualize how and when the load of the parts of a page happen, and display the response times quite exactly. You should see now, that the number of files requested over the network has decreased and the load of a page and all its emnbedded objects is faster than before.

So with an quick and easy-to-perform action you have decreased the page load times. When I added expiration headers to a number of static images, javascripts and css files on a publishing instance, the number of requests which went over the wire went down to 50%, the pageload times also decreased, so that even during a stress test the site still had a good performance. Of course, dynamic parts must be handled by their respective systems, but if we can offload requests from CQ, we should do this.

So as a conclusion: Some very basic changes to the system (some configuration adjustments to the apache config) may increase the speed of your site (publishing and authoring) dramatically. Such changes as described are not invasive to the system and are highly adjustible to the specific needs and requirements of your application.

Permission sensitive caching

In the last versions of the dispatcher (starting with the 4.0.1 release) Day added a very interesting feature to the dispatcher, which allows one to cache also content on dispatcher level which are not public.

Honwai Wong of the Day support team explained it very well on the TechSummit 2008. I was a bit suprised, but I even found it on slideshare (the first half of the presentation)

Honwai explains the benefits quite well. From my experience you can reduce the load on your CQ publishers (trading a request which requires the rendering of a whole page to to a request, which just checks the ACLs of a page).

If you want to use this feature, you have to make sure that for every group or user, who has to have a individual page, the dispatcher delivers the right one. Imagine you want the present the the logged-in users the latest company news, but not logged-in users shouldn’t get them. And only the managers get the link to the the latest financial data on the startpage. So you need a startpage for 3 different groups (not-logged-in users, logged-in users, managers), and the system should deliver it appropriatly. So having a single home.html isn’t enough, you need to distinguish.

The easiest way (and the Day-way ;-)) is to use a selector denoting the group the user belongs to. So home.group-logged_in.html or home.managers.html would be good. If no selector is given, we assume the user to be an anonymous user. You have to configure the linkchecker to rewrite all links to contain the correct selector. So if a user belongs to the logged_in group and requests the home.logged_in.html page, the dispatcher will ask the CQ ” the user has the following http header lines and is requesting the home.logged_in.html, is it ok?”. CQ then checks if the given http header lines do belong to a user of the group logged_in; because he is, it responses with “200 OK, just go on”. And then the dispatcher will deliver the cached file and there’s no need for the CQ to render the same page again and again. If the users doesn’t belong to that group, CQ will detect that and send a “403 Permission denied”, and the dispatcher forwards this answer then to the user. If a user is member of more than one group, having multiple “group-“selectors is perfectly valid.

Please note: I speak of groups, not of (individual) users. I don’t think that this feature is useful when each user requires a personalized page. The cache-hit ratio is pretty low (especially if you include often-changing content on it, e.g daily news or the content of an RSS feed) and the disk consumption would be huge. If a single page is 20k and you have a version cached for 1000 users, you have a disk usage of 20 MB for a single page! And don’t forget the performance impact of a directory filled up with thousands of files. If you want to personalize pages for users, caching is inappropriate. Of course the usual nasty hacks are applicable, like requesting the user-specific data via an AJAX-call and then modifying the page in the browser using Javascript.

Another note: Currently no documentation is available on the permission sensitive caching. Only the above linked presentation of Honwai Wong.

Creating cachable content using selectors

The major difference between between static object and dynamically created object is that the static ones can be stored in caches; their content they contain does not depend on user or login data, date or other parameters. They look the same on every request. So caching them is a good idea to move the load off the origin system and to accelerate the request-response cycle.

A dynamically created object is influenced by certain parameters (usually username/login, permissions, date/time, but there are countless other) and therefor their content may be different from request to request. These parameters are usually specified as query parameters and must not be cached (see the HTTP 1.1 specification in RFC 2616).

But sometimes it would be great, if we could combine these 2 approaches. For example you want to offer images in 3 resolutions: small (as a preview image e.g in folder view), big (full screen view) and original (the full resolution delivered by the picture-taking device). If you decide to deliver it as static object, it’s cachable. But you need then 3 names (one for each resolution), one for each resolution. Choosing this will blur the fact that these 3 images are the same and differ only in the fact of the image resolution. It creates 3 images instead having only one in 3 instances. Amore practical drawback is that you always have to precompute these 3 pictures and place them on a reachable location. Lazy generation is hard also.

If you choose the dynamic approach, the image would be available as one object for which the instance can be created dynamically. The drawback is here that it cannot be cached.

Day Communique has the feature (the guys of Day ported it also to Apache Sling) to use so-called selectors. They behave like the query parameters one used since the stoneage of the HTTP/HTML era. But they are not query parameters, but merely encoded in the static part of the URL. So the query part of the ULR (as of HTTP 1.1) is no longer needed.

So you can use the URLs /etc/medialib/trafficjam.preview.jpg, /etc/medialib/trafficjam.big.jpg and /etc/medialib/trafficjam.original.jpg to adress the image in the 3 required resolutions. If your dispatcher doesn’t find them in its cache, it will forward the request to your CQ, which can then scale the requested image on demand. Then the dispatcher can store the image and deliver it then from its cache. That’s a very simple and efficient way to make dynamic objects static and offload requests from your application servers.