Sling Model Performance

In my daily job as an SRE for AEM as a Cloud Service I often have to deal with performance questions, especially in the context of migrations of customer applications. Applications sometimes perform differently on AEM CS than they did on AEM 6.x, and a part of my job is to look into these cases.

This often leads to interesting deep dives and learnings; you might have seen this reflected in the postings of this blog 🙂 The problem this time was a tight loop like this:

for (Resource child: resource.getChildren()) {
SlingModel model = child.adaptTo(SlingModel.class);
if (model != null && model.hasSomeCondition()) {
// some very lightweight work
}
}

This code performed well with 1000 child resources in a AEM 6.x authoring instance, but quite poorly on an AEM CS authoring instance with the same number of child nodes. And the problem is not the large number of childnodes …

After wading knee-deep through TRACE logs I found the problem at an unexpected location. But before I present you the solution and some recommendations, let me you explain some background. But of course you can skip the next section and jump directly to the TL;DR at the bottom of this article.

SlingModels and parameter injection

One of the beauties of Sling Models is that these are simple PoJos, and properties are injected by the Sling Models framework. You just have to add matching annotations to mark them accordingly. See the full story in the official documentation.

The simple example in the documentation looks like this:

@Inject
String title;

which (typically) injects the property named “title” from the resource this model was adapted from. The same way you can inject services, child-nodes any many other useful things.

To make this work, the framework uses an ordered list of Injectors, which are able to retrieve values to be injected (see the list of available injectors). The first injector which returns a non-null value is taken and its result is injected. In this example the ValueMapInjector is supposed to return a property called “title” from the valueMap of the resource, which is quite early in the list of injectors.

Ok, now let’s understand what the system does here:

@Inject
@Optional
String doesNotExist;

Here a optional field is declared, and if there is no property called “doesNotExist” in the valueMap of the resource, other injectors are queried if they can handle that injection. Assuming that no injector can do that, the value of the field “doesNotExist” remains null. No problem at first sight.

But indeed there is a problem, and it’s perfomance. To demonstrate it, I wrote a small benchmark (source code on my github account), which does a lot of adaptions to Sling Models. When deployed to AEM 6.5.5 or later (or a recent version of the AEM CS SDK) you can run it via curl -u admin:admin http://localhost:4502/bin/slingmodelcompare

This is its output:

de.joerghoh.cqdump.performance.core.models.ModelWith3Injects: single adaption took 18 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith3ValueMaps: single adaption took 16 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithOptionalValueMap: single adaption took 18 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalValueMaps: single adaption took 20 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithOptionalInject: single adaption took 83 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalInjects: single adaption took 137 microsecond
s

It’s a benchmark which on a very simple list of resources tries adaptions to a number of Model classes, which are different in their type of annotations. So adapting to a model which injects 3 properties takes approximately 20 microseconds, but as soon as a model has a failing injection (which is declared with “@Optional” to avoid failing the adaption), the duration increases massively to 83 microseconds, and even 137 microseconds when 2 these failed injections are there.

Ok, so having a few of such failed injections do not make a problem per se (you could do 2’000 within 100 milliseconds), but this test setup is a bit artificial, which makes these 2’000 a really optimistic number:

  • It is running on a system with a fast repository (SDK on my M1 Macbook); so for example the ChildResourceInjector does not has almost no overhead to test for the presence of a childResource called “doesNotExist”. This can be different, for example on AEM CS Author the Mongo storage has a higher latency than the segmentStore on the SDK or a publish. If that (non-existing) child-resource is not in the cache, there is an additional latency in the range of 1ms to load that information. What for? Well, basically for nothing.
  • The OsgiInjector is queried as well, which tries to access the OSGI ServiceRegistry; this registry is a central piece of OSGI, and it’s consistency is heavily guarded by locks. I have seen this injector being blocked by these locks, which also adds latency.

That means that these 50-60 microseconds could easily multiply, and then the performance is getting a problem. And this is the problem which initially sparked this investigation.

So what can we do to avoid this situation? That is quite easy: Do not use @Inject, but use the specialized injectors directly (see them in the documentation). While the benefit is probably quite small when it comes to properties which are present (ModelWith3Injects tool 18 microseconds vs 16 microseconds of ModelWith3ValueMaps), the different gets dramatic as soon as we consider failed injections:

Even in my local benchmark the improvement can be seen quite easily, there is almost no overhead of such a failed injection, if I explicitly mark them as Injection via the ValueMapInjector. And as mentioned, this overhead can be even larger in reality.

Still, this is a micro-optimization in the majority of all cases; but as mentioned already, many of these optimizations implemented definitely can make a difference.

TL;DR Use injector-specific annotations

Instead of @Inject use directly the correct injector. You normally know exactly where you want that injected value to come from.
And by the way: did you know that the use of @Inject is discouraged in favor of these injector-specific annotations?

(Note to myself: The Sling Models documentation needs an update, especially the examples.)

How to deal with the “TooManyCallsException”

I randomly see the question “We get the TooManyCallsException while rendering pages, and we need to increase the threshold for the number of inclusions to 5000. Is this a problem? What can we do so we don’t run into this issue at all?”

Before I answer this question, I want to explain the background of this setting, why it was introduced and when such a “Call” is made.

Sling rendering is based on Servlets; and while a single servlet can handle the rendering of the complete response body, that is not that common in AEM. AEM pages normally consistent of a variety of different components, which internally can consist of distinct subcomponents as well. This depends on the design approach the development has choosen.
(It should be mentioned that all JSPs and all HTL scripts are compiled into regular Java servlets.)

That means that the rendering process can be considered as tree of servlets, and servlets calling other servlets (with the DefaultGetServlet being the root of such a tree when rendering pages). This tree is structured along the resource tree of the page, but it can include servlets which are rendering content from different areas of the repository; for example when dealing with content fragments or including images, which require their metadata to be respected.

It is possible to turn this tree into a cyclic graph; and that means that the process of traversing this tree of servlets will turn into a recursion. In that case request processing will never terminate, the Jetty thread pool will quickly fill up to its limit, and the system will get unavailable. To avoid this situation only a limited number of servlet-calls per request is allowed. And that’s this magic number of 1000 allowed calls (which is configured in the Sling Main Servlet).

Knowing this let me try to answer the question “Is it safe to increase this value of 1000 to 5000?“. Yes, it is safe. In case your page rendering process goes recursive it terminates later, which will increase a bit the risk of your AEM instance getting unavailable.


Are there any drawbacks? Why is the default 1000 and not 5000 (or 10000 or any higher value)?” From experience 1000 is sufficient for the majority of applications. It might be too low for applications where the components are designed very granular which in turn require a lot of servlet calls to properly render a page.
And every servlet call comes with a small overhead (mostly for running the component-level filters); and even if this overhead is just 100 microseconds, 1000 invocations are 100 ms just for the invocation overhead. That means you should find a good balance between a clean application modularization and the runtime performance overhead of it.

Which leads to the next question: “What are the problematic calls we should think of?“. Good one.
From a high-level view of AEM page renderings, you cannot avoid the servlet-calls which render the components. That means that you as an AEM application developer cannot influence the overall page rendering process, but you can only try to optimise the rendering of individual (custom) components.
To optimise these, you should be aware, that the following things trigger the invocation of a servlet during page rendering:

  • the <cq:include>, <sling:include> and <sling:forward> JSP tags
  • the data-sly-include statement of HTL
  • and every method which invokes directly or indirectly the service() method of a servlet.

A good way to check this for some pages is the “Recent requests” functionality of the OSGI Webconsole.

AEM micro-optimization (part 4) – define allowed templates

This time I want to discuss a different type of micro-optimization. It’s not something you as a developer can implement in your code, but it’s rather a question of the application design, which  has some surprising impact. I came across it when I recently investigated poor performance in the Siteadmin navigation. And although I did this investigation in AEM as a Cloud Service, the logic on AEM 6.5 behaves the same way.

When you click in the siteadmin navigation through your pages, AEM collects a lot of information about pages and folders to display them in the proper context. For example, when you click on page with child pages, it collects information what actions should be displayed if a specific child node is going to be selected (copy, paste, publish, …)

An important information is if the “Create page” action should be made available. And that’s the thing I want to outline in this article.

Screenshot: “Create” dialog

Assuming that you have the required write permissions on that folder, the most important is if templates are allowed to be created as children of the current page. The logic is described in the documentation and is quite complex.

In short:

  • On the content the template must be allowed (using the cq:allowedTemplates property (if present) AND
  • The template must be allowed to be used as a child page of the current page

Both conditions are must be met for a template to make it eligible to be used as a source for a new page. To display the entry “Page” it’s sufficient if at least 1 template is allowed.

Now let’s think about the runtime performance of this check, and that’s mostly determined by the total number of templates in the system. AEM determines all templates by this JCR query:

//jcr:content/element(*,cq:Template)

And that query returns 92 results on my local SDK instance with WKND installed. If we look a bit more closely to the results, we can determine 3 different types of templates:

  • Static templates
  • Editable templates
  • Content Fragment models

So depending on your use-case it’s easy to end up with hundreds of templates, and not all of them are applicable at the location you are currently in. In fact, typically just very few templates can be used to create a page here. That means that the check most likely needs to iterate a lot to eventually encounter a template which is a match.

Let’s come back to the evaluation if that entry should be displayed. If you have defined the cq:allowedTemplates property  on the page or it’s ancestors it’s sufficient to check the templates listed there. Typically it’s just a handful of templates, and it’s very likely that you find a “hit” early on, which immediately terminates this check with a positive result. I want to explicitly mention that not every template listed can be created here, because there also other constraints (e.g. the parent template must be of a certain type etc) which must match.

 If template A is allowed to be used below /content/wknd/en, then we just need to check the single Template A to get that hit. We don’t care, where in the list of templates it is (which are returned by the above query), because we know exactly which one(s) to look at.

If that property is not present, AEM needs to go through all templates and check the conditions for each and every one, until it finds that positive result.  And the list of templates is identical to the order in which the templates are returned from the JCR query, that means the order is not deterministic. Also it is not possible to order the result in a helpful way, because the semantic of our check (which include regular expressions) cannot be expressed as part of the JCR query.

So you are very lucky if the JCR query returns a matching template already at position 1 of the list, but that’s very unlikely. Typically you need to iterate tens of templates to get a hit.

So, what’s the impact on the performance of this iteration and the checks? In an synthetic check with 200 templates, when I did not have any match, it took around 3-5ms to iterate and check all of the results.

You might ask, “I really don’t feel a 3-5ms delay”, but when the list view in siteadmin performs this check for up to 40 pages in a single request, it’s rather a 120-200 millisecond difference. And that is a significant delay for requests where bad performance is visible immediately. Especially if there’s a simple way to mitigate this.

And for that reason I recommend you to provide “cq:allowedTemplates” properties on your content structure. In many cases it’s possible and it will speed up the siteadmin navigation performance.

And for those, who cannot change that: I currently working on changing the logic to speedup the processing for the cases where no cq:allowedTemplates property is applicable. And if you are on AEM as a Cloud Service, you’ll get this improvement automatically.

AEM micro-optimization (part 1)

As a followup on the previous article I want to show you, how a micro-optimization can look like.  My colleague Miroslav Smiljanic found that there is a significant difference in the time it takes to compute these statements (1) and (2).

Node node = …
Session session = node.getSession();
String parentPath = node.getParent().getPath();

Node p1 = node.getParent(); // (1)
Node p2 = session.getNode(parentPath); // (2)

assertEquals(p1,p2);

He did the whole writeup in the context of a suggested improvement in Sling, and proved it with impressive numbers.

Is this change important? Just by itself it is not, because going the resource/node tree upwards is not that common compared to going downwards the tree. So replacing a single call might yield only in an improvement of a fraction of a milisecond, even if the case (2) is up to 200 times faster than (1)!

But if we can replace the code in all cases where the getParent() can be used with the performant getParent() call, especially in the lowlevel areas of AEM and Sling, all areas might benefit from it. And then we don’t execute it only once per page rendering, but maybe a hundred times. And then we might end up with tens of miliseconds of improvement already, for any request!

And in special usecases the effect might be even higher (for example if your code is constantly traversing the tree upwards).

Another example of such an micro-optimization, which is normally quite insignificant but can yield huge benefits in special cases can be found in SLING-10269, where I found that a built-in caching of the isResourceType() results reduces the rendering times of some special requests by 50%, because it is done thousands of times.

Typically micro-optimizations have these properties:

  • In the general case the improvement is barely visible (< 1% improvement of performance)
  • In edge cases they can be a life saver, because they reduce execution time by a much larger percentage.

The interesting part is, that these improvements accumulate over time, and that’s where it is getting interesting. When you have implemented 10 of these in low-level routines the chances are high that your usecase benefits from it as well. Maybe by 10 times 0.5% performance improvement, but maybe also a 20% improvement, because you hit the sweet spot of one of these.

So it is definitely worth to pay attention to these improvements.

My recommendation for you: Read the entry in the Oak “Do’s and Don’ts” page and try to implement this learning in your codebase. And if you find more of such cases in the Sling codebase the community appreciates a ticket.

(Photo by KAL VISUALS on Unsplash)

The effect of micro-optimizations

Optimizing software for speed is a delicate topic. Often you hear the saying “Make it work, make it right, make it fast”, implying performance optimization should be the last step you should do when you code. Which is true to a very large extent.

But in many cases, you are happy if your budget allows to you to get to the “get it right” phase, and you rarely get the chance to kick off a decent performance optimization phase. That’s a situation which is true in many areas of the software industry, and performance optimization is often only done when absolutely necessary. Which is unfortunate, because it leaves us with a lot of software, which has performance problems. And in many cases a large part of the problem could be avoided if only a few optimizations were done (at the right spot, of course).

But all this statement of “performance improvement phase” assumes, that it requires huge efforts to make software more performant. Which in general is true, but there are typically a number of actions, which can be implemented quite easily and which can be beneficial. Of course these rarely boost your overall application performance by 50%, but most often it just speeds up certain operations. But depending on the frequency these operations are called it can sum into a substantial improvement.

I did once a performance tuning session on an AEM publish instance to improve the raw page rendering performance of an application. The goal was to squeeze more page responses out of the given hardware. Using a performance test and a profiler I was able to find the creation of JCR sessions and Sling ResourceResolvers to take 1-2 milliseconds, which was worth to investigate. Armed with this knowledge I combed through the codebase, reviewed all cases where a new Session is being created and removed all cases where it was not necessary. This was really a micro-optimization, because I focussed on tiny pieces of the code (not even the areas which are called many times) , and the regular page rendering (on a developer machine) was not improving at all. But in production this optimization turned out to help a lot, because it allowed us to deliver 20% more pages per second out of the publish at peak.

In this case I spend quite some amount of time to come to the conclusion, that opening sessions can be expensive under load. But now I know that and spread that knowledge via code reviews and blog posts.

Most often you don’t see the negative effect of these anti-patterns (unless you overdo it and every Sling Models opens a new ResourceResolver), and therefor the positive effects of applying these micro-optimizations are not immediately visible. And in the end, applying 10 micro-optimizations with a ~1% speedup each sum up to a pretty nice number.

And of course: If you can apply such a micro-optimization in a codepath which is heavily used, the effects can be even larger!

So my recommendation to you: If you come across such a piece of code, optimize it. Even if you cannot quantify and measure the immediate performance benefit of it, do it.

Same as:

(for int=0;i<= 100;i++) {
  othernumber += i;
}

I cannot quantify the improvement, but I know, that

othernumber += 5050;

is faster than the loop, no questions asked. (Although that’s a bad example, because hopefully the compiler would do it for me.)

In the upcoming blog posts I want to show you a few cases of such micro-optimizations in AEM, which I personally used with good success. Stay tuned.

(Photo by Michael Longmire on Unsplash)

Slow deployments on AEM 6.4/6.5

A recent post on the AEM forums challenged me to look into an issue I observed myself but did not investigate further.

The observation is that during deployments maintenance tasks are stopped and started a lot, and this triggers a lot of other activities, including a lot of healtcheck executions. This slows down the deployment times and also pollutes the logfiles during deployments.

The problem is that the AEM Maintenance TaskScheduler is supposed to react on changes on some paths in the repository (where the configuration is stored), but unfortunately it also reacts on any change of ResourceProviders (and every Servlet is implemented as a single ResourceProvider). And because this causes a complete reload/restart of the maintenance tasks (and some healthchecks as well), it’s causing quite some delay.

But this behaviour is controlled via OSGI properties, which are missing by default, so we can add them on our own 🙂

Just create a OSGI configuration for com.adobe.granite.maintenance.impl.TaskScheduler and add a single multi-value property named “resource.change.types” with  the values “ADDED”, “CHANGED”, “REMOVED”.

(please also report this behavior via Daycare and refer to GRANITE-29609, so we hopefully get a fix for it, instead of applying this workaround).

Understanding the “Oak Repository Statistics” MBean

In the last releases AEM has been greatly enhanced to provide information which are suitable for health detection. Especially Oak provides a huge amout of MBeans which can be monitored. But sometimes they are a bit hard to understand. Based on some ongoing activities I digged through the “Oak Repository statistics” MBean and found it quite useful, even for some basic understanding and analysis.

I did this analysis and the screenshot on AEM 6.5, but this MBean is present at least since AEM 6.1 (probably even 6.0) and its content hasn’t changed much.

When you access this MBean the top of the page looks like this (this instance has just started):

Oak Repository Statistics MBean

There are a number of values collected, and presented for a number of times:

  • per second: The raw value in each second for the last minute.
  • per minute: The aggregated value on a minute basis for the last hour
  • per hour: The aggregated value on a hourly basis for the last day
  • per day: the aggregated value on a daily basis

The aggregation differs based on the type of the metric:

  • Gauge: This is a simple value, which is not further processed. When values of types must be aggregated, an average is calculated.
  • Counter: This a number, which can be accumulated. When values of type counter must be aggregated, they are summed up.
Attribute NameTypeDescription
SessionCountGaugeThe number of JCR sessions, which are currently open
SessionLoginGaugeThe number of sessions opened within that time
SessionReadCountCounterThe number of read operations in the JCR (over all sessions)
SessionReadDurationGaugeThe total time spent in read operations (nanoseconds)
SessionReadAverageGaugethe average duration for read operations (SessionReadDuration divided by the number of reads)
SessionWriteCountCounterThe number of write operations in the JCR (over all sessions). Be aware that session.refresh() is also counted as write operation!
SessionWriteDurationGaugeTotal time spent writing to sessions in nanoseconds
SessionWriteAverageGaugethe average duration for write operations (SessionWriteDuration divided by the number of writes)
QueryCountCounterThe number of JCR queries executed
QueryDurationGaugeThe total time spent in JCR query operations (miliseconds)
QueryAverageGaugethe average duration for queries (QueryDuration divided by the number of writes)
ObservationEventCountCounterThe number of observation events delivered to all listeners
ObservationEventDurationGaugeThe total time spent processing events by all observation listeners in nanoseconds
ObservationEventAverageGaugethe average duration spent processing observation events (ObservationEventDuration divided by the number of events)
ObservationQueueMaxLengthGaugeThe maximum length of the JCR Observation Queue; in newer Oak versions this queue does no longer exist, and then the value is -1

This measurement is done to limit the amount of data which needs to be stored. And this data is stored within the JVM inside the Oak bundles; that means that any restart of the JVM or restart of the Oak bundles will reset these values. If you want to persist these values you need to read them via JMX and store them.

Ok, what can you do with all this data? Well, it can help you to answer many questions. For example you can find out very easy, if you have a session leak.Because then numbers ion the SessionCount attribute always increase over time. It’s also interesting to find out what is happening within your system when it’s completely idle. Are there repository writes which you are not expecting? Queries every few seconds?

If you are investigating performance issues, or if you want to avoid them, you should have a look into this MBean.

Update: Fixed the only link in this post. Thanks Oswald for reporting!

“We have an urgent performance issue” (part 2)

As a reaction of the last post I got the question by Oswaldo about specific recommendations on performance. Actually, there are a lot. But that’s material for another blog post 🙂 or skip to the bottom of this post.

Instead I want to give you a recommendation on how to handle situations when you did not have time nor capacity you can spend on thinking about performance and response times. But as an experienced technical leader you know that at some point this question will arise for sure. You might get a few hours to spend on that question, but how do you spend it most efficiently?

Clearly not for performance optimization! Because it’s not enough time to analyze and improve substantial parts of the application. And tomorrows changes might render these improvements useless…

Instead I would recommend you to spend this time in communication and building rapport with people who can help you in case such a performance problem arises. Get in contact with the operations people which are operating your system and application. Understand how they work and what tools they use. Understand how they can help you in case of performance issues, what information they can provide to you. Ask for an account on their monitoring system, just to demonstrate interest in their work and problems. And potentially give them some tips what they can additionally do to improve the quality of the information (for example asking if they can also provide the raw data and not only the visualization based on aggregated data). Or show them some hints how they improve their work with your application.

The biggest value in that activity is the fact, that in case the dreaded performance issue is noticed on an exec level, you already know who to talk to. You know a bit how the others are working and how you can help them. As a tech lead it’s then much easier to ask for logfiles, traffic patterns, CPU usage graphs and I/O latencies, threaddumps etc. You know upfront what information it operations already collects by default. You might have direct access to a monitoring system to get more information. You can even get a warning from the ops people in advance that some real big escalation is imminent. For me this is the best you can get if you have just a few hours to spend.

You might ask why is that important. Because it reduces the TTAD (time to actionable data) dramatically in case of such performance issues. You know who to get on the phone and into calls to start investigation. You already know what information is already available or you can even access it directly. You can report “We are analyzing data and can come up with first suggestions within the day” instead of “we are talking to IT and see how they can support us to get data”.

That’s much more important than spending some hours on random performance tuning. And in case you ever run into performance issues, these hours are one of the best investment you made in the whole project.

(And as random recommendation to improve AEM request rendering times: Disable the MobileRedirectFilter (PID: com.day.cq.wcm.mobile.core.impl.redirect.RedirectFilter) by setting the configuration parameter “redirect.enabled” to “false”. In the age of responsive websites it’s purpose is no longer given. And under load its performance impact can be significant.)

4 problems in projects on performance tests

Last week I was in contact with a collague of mine who was interested in my experience on performance tests we do in AEM projects.  In the last 12 years I worked with a lot of customers and implemented smaller and larger sites, but all of them had at least one of the following problems.

 (1) Lack of time

In all project plans time is allocated for performance tests, and even resources are assigned to it. But due to many unexpected problems in the project there are delays, which are very hard to compensate when the golive or release date is set and already announced to or by senior management. You typically try to bring live the best you can, and steps as performance tests are typically reduced first in time. Sometimes you are able to do performance tests before the golive, and sometimes they are skipped and postponed after golive. But in case when timelines cannot be met the quality assurance and performance tests are typical candidates which are cut down first.

(2) Lack of KPIs

When you the chance to do performance tests, you need KPIs. Just testing random functions of your website is simply not sufficient if you don’t know if this function is used at all. You might test the functionality which the least important one and miss the important ones. If you don’t have KPIs you don’t know if your anticipated load is realistic or not. Are 500 concurrent users good enough or should it rather be 50’000? Or just 50?

(3) Lack of knowledge and tools

Conducting performance tests requires good tooling; starting from the right environment (hopefully comparable sizing to production, comparable code etc) to the right tool (no, you should not use curl or invent your own tool!) and an environment where you execute the load generators. Not to forget proper monitoring for the whole setup. You want to know if you provide enough CPU to your AEM instances, do you? So you should monitor it!.

I have seen projects, where all that was provided, even licenses for the tool Loadrunner (an enterprise grade tool to do performance tests), but in the end the project was not able to use it because noone knew how to define testcases and run them in Loadrunner. We had to fall back to other tooling or the project management dropped performance testing alltogether.

(4) Lack of feedback

You conducted performance tests, you defined KPIs and you were able to execute tests and get results out of them. You went live with it. Great!
But does the application behave as you predicted? Do you have the same good performance results in PROD as in your performance test environment? Having such feedback will help you to refine your performance test, challenging your KPIs and assumptions. Feedback helps you to improve the performance tests and get better confidence in the results.

Conclusion

If you haven’t encountered these issues in your project, you did a great job avoid them. Consider yourself as a performance test professional. Or part of a project addicted to good testing. Or you are so small that you were able to ignore performance tests at all. Or you just deploy small increments, that you can validate the performance of each increment in production and redeploy a fixed version if you get into a problem.

Have you experienced different issues? Please share them in the comments.

TarPM lowlevel write performance analysis

The Tar Persistence Manager (TarPM) is the default persistence mechanism in CQ. It stores all data in a filesystem and is quite good and performant.
But there are situations where you would like what actually happens in the repository:

  • When your repository grows and grows
  • When you suffer a huge number of JCR events
  • When you do performance optimization
  • When you’re just interested, what happens under the hoods

In such occasions you can use the “CRX Change history” page on the OSGI console; if you choose the “details” link to the most recent tar file, it will show you the details of each transactions: the name of the changed nodes and a timestamp.

CRX Change History preview

In this little screenshot you can see, that first some changes to the image node have been written; immediately afterwards the corresponding audit event has been stored in the repository.

I use this tool especially when I need to check the data which is written to the repository. Especially when multiple operations run concurrently and I want need to monitor the behavior of the some application code I don’t know very well, this screen is of huge help. I can find out, if some process really writes that much data as anticipated. Also the amount of nodes written in a single transaction shows, if batch saves are used or if the developer preferred lots of individual saves (which have a performance penalty).
And you can really check, if your overflowing JCR event queue is caused by many write operations or by slow observation listeners.

So it’s a good tool if you assume that the writes to your repository should be quicker.