Managing repository growth

On almost every project there is this time, when (all of a sudden) the disk space assigned to AEM instances becomes full; in most (!) cases such a situation is detected early on so there is time to react, which often means just adding more disk capacity.

But after that the questions arise: Why is our repository so large? Why does it consume more than the initially estimated disk space? What content exactly is causing this problem? What went wrong so we actually got into that situation?

From my point of view there are 3 different views on this situation, which can be an answer:

Disk space is not managed well, that means that unnecessary stuff is consuming lot of space outside of the repository.
The maintenance jobs are not executed.
The estimation for content and content growth was not realistic.

And in most cases you need to check all these 3 views to answer the question “Why are we running out of space?”

Disk space is not managed well
This is an operations problems in the first place, because non-repository data is basically consuming disk space which has been planned for the repository. Often seen:

Heapdumps and log files are not rotated or removed.
Manually created backups files have just been copied to a folder next to the original one once, but in meanwhile they are not useful any more, because they are totally out of date.
The regular backup process is creating temporary files, which are not cleaned up; or the backup process itself consumes temporary disk space, which lets the disk consumption spike.

So this can just be handled by careful working and in-time purging of old data.

The maintenance jobs are not executed
Maintenance jobs are an essential part of the ongoing job to remove the unnecessary fat from you AEM instance, be it on the content level or on a repo level. It includes

workflow purge
audit log purge
repository compaction (if you use TarMK)
datastore GC (if you use a datastore)

You should always keep an eye on these; the Maintenance Dashboard is a great help here. But do not rely on it blindly!

Your estimation for content and content growth was not realistic
That’s a common problem; you have to give an initial hardware sizing, which also includes the amount of disk space used by the repository. You do your best to include all relevant parameters, you add some buffer on top. But that is an estimation on the beginning of the project, when you don’t know all the requirements and their impact on disk consumption in detail. But that’s what you said, and changing them afterwards is always problematic.

Or AEM is used differently than initially anticipated and all the assumptions you have based your initial hardware sizing are not longer true. Or you just forgot to add the versioning of the assets to your calculation. Or…
There are a lot of cases where in retrospective the initial sizing of the required disk space was just incorrect. In that case you have only chance: Redo the calculation right now! Take your new insights and create a new hardware sizing. And then implement it and add more hardware.

And the only way I see to avoid such situations is: Do not make estimations for the next 3 years! Be a agile and review your current numbers every 3 months; during this review you can also determine the needs for the next months and plan accordingly. Of course this assumes that you are flexible in terms of disk sizing, so for any non-trivial setup the use of SAN as storage technology (plus a volume manager) is my preferred choice!

Of course this does not free yourself from working on the cleanup of the systems and running all required maintenance jobs; but it will make the review of the used disk space a regular task; so you should see deviations from your anticipated plan much earlier.

What is writing to my Oak repository?

If you were ever curious what’s happening on the Oak repository in AEM (and I can tell: a lot!), there’s a chance to logs all the repo write actions.

Just set create a logger for „org.apache.jackrabbit.oak.jcr.operations.write“ on loglevel TRACE and there you go.

21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:userid] 21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:path] 21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:type] 21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:category] 21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/userId] 21.05.2016 23:38:34.354 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/path] 21.05.2016 23:38:34.354 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/type] 21.05.2016 23:38:34.354 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:properties] 21.05.2016 23:38:34.356 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] save

This thread with the name [Thread-8145] writes to the /var/audit path, so it’s quite likely related to the Audit logger. And this is using the session [session-43731]; the session name is random per session, but it is a very useful information:

you can identify a single session and what’s happening within this session. It is especially useful to determine what this specific session is writing and how often there are saves.
If you have the session name, and it is a long-running session, you can look up this session in the JMX MBean console; in Oak 1.0 and Oak 1.2 a stack trace is stored when the session is opened; in Oak 1.4 this has been removed for performance reasons, but you can get it back when you set the System property ‚oak.sessionStats.initStackTraceThreshold‘ to ‚0‘ (zero).

So if you need to understand what’s happening on your repository and what might causing repository write activity, this is an easy way to go. The only drawback: Such logging eats up a lot of diskspace, especially if run it for an extended period of time.

It’s also possible to do this for reads, but at least on Oak 1.0 it doesn’t log the paths, but only the operation; so it’s less useful here. And it produces a lot of data: Having it enabled for 2 minutes and 1 page load on my local instance it produced 10 megabytes of logs.

Disabling services and components in AEM

Sometimes you need to disable a service or a component; a simple example for this a servlet, which is used on authoring instance, but which must not be active on publish. There are several ways to achieve this. (In this blog post whenever I mention “service”, you can implicitly assume that it also works for SCR components; technically even “component” would be right wording, but in the AEM world “component” is heavily used word with a number of different meanings.)

A very simple and smart solution for your own codebase is the use of the SCR configuration policy; when this is used on a OSGI service the SCR runtime won’t start the service if no dedicated OSGI configuration exists (even the activate() method isn’t called). And because you can create OSGI configs based on run mode it’s the perfect way to enable or disable services.

Nice examples for this can be found in ACS AEM Commons:

This the the recommended way to write services; the decision to run or not run it is done on deployment/configuration time and not during development. And it’s the most complete way, because with proper configuration the service will never get active at all. The only (very) small drawback is that your project-managed OSIG configurations will increase.

A different way is to use a special property „enabled“, which is then checked in the service before doing something useful. But when you use the enabled-property, the service is properly started and registered to the OSGI runtime; thus it might get registered as servlet and into other service factories. You never know what is happening or what not, so you it’s always best to have the code ready.
This approach gives you also the choice on deployment time to enable or not to enable the service. But it has the drawback, that the service is active and code of it might run before checking the „enabled“ status. So from my understanding there is never really a usecase for this “enabled” property. An if it has a different function than turning the service on or off, it shouldn’t be named “enable”.

If you need to disable services, which are not under your control, and which neither offer a „enabled“ property nor the configurationPolicy approach, the only remaining choice is the ComponentDisabler of ACS AEM Commons. That’s basically a hack and should be your last resort, because it cannot prevent the startup of the service, but in fact shuts it down after the service has been started (and might have already been working). But if you can live with this constraints, it’s the way to go.

If you are a developer, I strongly recommend to learn and use the SCR ConfigPolicy setting!