OSGi DS & Metatype & SCR properties

When I wrote the last blog post on migrating to OSGi annotations, I already mentioned that the property annotations you used to use with SCR annotations cannot be migrated 1:1, but instead you have to decide if you add it the OCD or if you add them as properties to the @Component annotation.

I was remembered of that when I worked on the fixing the process label properties for the workflow steps contained in ACS AEM Commons (PR 1645). And the more I think about it, the more I get the impression that this might be causing some confusion in the adoption of the OSGI annotations.

Excursion to the OSGi specification

First of all, in OSGi there 2 specifications, which are important in this case: Declarative Services (DS, chapter 112 in the OSGI r6 enterprise specification, sometimes also referenced as Service Component Model) and the Metatype Services (chapter 105 in the OSGI r6 enterprise specification). See the OSGI website for download (unfortunately there is no HTML version available for r6, only PDFs).

Declarative Services deals with the services and components, their relations and the required things around it. Quoting from the spec (112.1):

The service component model uses a declarative model for publishing, finding and binding to OSGi services. This model simplifies the task of authoring OSGi services by performing the work of registering the service and handling service dependencies. This minimizes the amount of code a programmer has to write; it also allows service components to be loaded only when they are needed.

Metatype Services care about configuration of services. Quoting from chapter 105.1

The Metatype specification defines interfaces that allow bundle developers to describe attribute types in a computer readable form using so-called metadata. The purpose of this specification is to allow services to specify the type information of data that they can use as arguments. The data is based on attributes, which are key/value pairs like properties.

Ok, how does this relate to the @Property annotation in the Felix SCR annotations? I would say, that this annotation cannot be clearly attributed to either DS or Metatype, but it served both.

You could add @Properties as annotation to class, or you could add it as with an @Property to a field. You could add a property “metatype=true” to the annotation, and then it appeared in the OSGI console (then it was a “real” metatype property in the sense of the Metatype specification).

But in either way, all the properties were provided through the ServiceContext.getProperties() method; that means, in reality it never really made a difference how you defined the property, if a the class level or to properties, if you added the metatype=true property or not. That was nice and in most times also very convenient.

This changed with the OSGI annotations. Because now the properties are described in a class annotated with the @ObjectClassDefinition annotation. Type-safe and named. But here it’s clearly a Metatype thing (it’s configuration), and it cannot be used with the Declarative Service (the services and components thing) in parallel. Now you have to make the decision: Is it a configuration item (something I use in the code) or is it a property which influences the component itself?

As an example, with SCR annotations you could write

@Component @Service
@Properties({
@Property(name="sling.servlets.resourcetypes",value="project/components/header"),
@Property(name="sling.servlets.selector",value="foo")
})
public class HeaderServlet implements SlingSafeMethodsServet { ...

Now, as these properties were visible via metatype as well, you could also overwrite them using some OSGI configuration and register the servlet on a different selector just by configuration. Or you could read these properties from the ServiceContext. That was not a problem (and hopefully noone ever really used it …).

With OSGI annotations this is no longer possible. You have configuration and component properties. You can change the configuration, but not the component properties any more.

What does that mean for you?

Mostly, don’t change all properties blindly to properties of the @ObjectClassDefinition object. For example the label of a workflow step is not configuration, but rather a property. That means there you should use something like this:

@Component(properties= {
"process.label=My process label"
}
public class MyWorkflowProcess implements WorkflowProcess {...

Disclaimer: I am not an OSGI expert, this is just my impression from dealing with that stuff a lot. Carsten, David, feel free to to correct me 🙂


AEM technical conferences in 2018

In the last years I attended a few conferences in the AEM space and I never got disappointed. On all of them I attended brilliant talks and presentation, meet with a lot of people and in general enjoyed well organized conferences.

In 2018 I will try again to visit at least one conference, and for convenience I collected here the AEM conferences I know. In case I missed one please leave a comment.
I ignore the local meetups here, because typically you won’t take a flight to attend such a meetup 🙂

In no particular order I found these conferences covering technical AEM topics in 2018:

  • Adobe Summit US (although it covers muuuuuuch more than just AEM) in Las Vegas, NV
  • Immerse 2018, a virtual conference
  • Evolve in San Diego, CA
  • AdaptTo in Potsdam, Germany; in 2018 at a larger venue than in the years before

At the moment I can already announce, that I will participate at the Adobe Summit (giving again a lab session). I am not sure if I will present at Immerse.

If you miss any conference with AEM related content, just speak up, I’ll add it to the list.

What’s new in Sling with AEM 6.3

AEM 6.3 is out. Besides the various product enhancements and new features it also includes updated versions of many open source libraries. So it’s your chance to have a closer at the version numbers and find out what changed. And again, quite a lot.

For your convenience I collected the version information of the sling bundles which are part of the initial released version (GA). Please note that with Servicepacks and Cumulative Hotfixes new bundle versions might be introduced.
If a version number is marked red, it has changed compared to the previous AEM versions. Bundles which are not present in the given version are marked with a dash (-).

For the versions 5.6 to 6.1 please see this older posting of mine.

Symbolic name of the bundle AEM 6.1 (GA) AEM 6.2 (GA) AEM 6.3 (GA)
org.apache.sling.adapter 2.1.4 2.1.6 (Changelog) 2.1.8 (Changelog)
org.apache.sling.api 2.9.0 2.11.0 (Changelog) 2.16.2 (Changelog)
org.apache.sling.atom.taglib 0.9.0.R988585 0.9.0.R988585 0.9.0.R988585
org.apache.sling.auth.core 1.3.6 1.3.14 (Changelog) 1.3.24 (Changelog)
org.apache.sling.bgservlets 0.0.1.R1582230 1.0.6 (Changelog)
org.apache.sling.bundleresource.impl 2.2.0 2.2.0 2.2.0
org.apache.sling.caconfig.api 1.1.0
org.apache.sling.caconfig.impl 1.2.0
org.apache.sling.caconfig.spi 1.2.0
org.apache.sling.commons.classloader 1.3.2 1.3.2 1.3.8 (Changelog)
org.apache.sling.commons.compiler 2.2.0 2.2.0 2.3.0 (Changelog)
org.apache.sling.commons.contentdetection 1.0.2 1.0.2
org.apache.sling.commons.fsclassloader 1.0.0 1.0.2 (Changelog) 1.0.4 (Changelog)
org.apache.sling.commons.html 1.0.0 1.0.0 1.0.0
org.apache.sling.commons.json 2.0.10 2.0.16 (Changelog) 2.0.20 (Changelog)
org.apache.sling.commons.log 4.0.2 4.0.6 (Changelog) 5.0.0 (Changelog)
org.apache.sling.commons.log.webconsole 1.0.0
org.apache.sling.commons.logservice 1.0.4 1.0.6 (Changelog) 1.0.6
org.apache.sling.commons.metrics 1.0.0 (Changelog) 1.2.0 (Changelog)
org.apache.sling.commons.mime 2.1.8 2.1.8 2.1.10 (Changelog)
org.apache.sling.commons.osgi 2.2.2 2.4.0 (Changelog) 2.4.0
org.apache.sling.commons.scheduler 2.4.6 2.4.14 (Changelog) 2.5.2 (Changelog)
org.apache.sling.commons.threads 3.2.0 3.2.6 (Changelog) 3.2.6
org.apache.sling.datasource 1.0.0 1.0.0 1.0.2 (Changelog)
org.apache.sling.discovery.api 1.0.2 1.0.2 1.0.4 (Changelog)
org.apache.sling.discovery.base 1.1.2 (Changelog) 1.1.6 (Changelog)
org.apache.sling.discovery.commons 1.0.10 (Changelog) 1.0.18 (Changelog)
org.apache.sling.discovery.impl 1.1.0
org.apache.sling.discovery.oak 1.2.6 (Changelog) 1.2.16 (Changelog)
org.apache.sling.discovery.support 1.0.0 1.0.0 1.0.0
org.apache.sling.distribution.api 0.1.0 0.3.0 0.3.0
org.apache.sling.distribution.core 0.1.1.r1678168 0.1.15.r1733486 0.2.6
org.apache.sling.engine 2.4.2 2.4.6 (Changelog) 2.6.6 (Changelog)
org.apache.sling.event 3.5.5.R1667281 4.0.0 (Changelog) 4.2.0 (Changelog)
org.apache.sling.event.dea 1.0.0 1.0.4 (Changelog) 1.1.0 (Changelog)
org.apache.sling.extensions.threaddump 0.2.2
org.apache.sling.extensions.webconsolesecurityprovider 1.1.4 1.1.6 (Changelog) 1.2.0 (Changelog)
org.apache.sling.featureflags 1.0.0 1.0.2 (Changelog) 1.2.0 (Changelog)
org.apache.sling.fragment.ws 1.0.2 1.0.2 1.0.2
org.apache.sling.fragment.xml 1.0.2 1.0.2 1.0.2
org.apache.sling.hapi 1.0.0 1.0.0
org.apache.sling.hc.core 1.2.0 1.2.2 (Changelog) 1.2.4 (Changelog)
org.apache.sling.hc.webconsole 1.1.2 1.1.2 1.1.2
org.apache.sling.i18n 2.4.0 2.4.4 (Changelog) 2.5.6 (Changelog)
org.apache.sling.installer.console 1.0.0 1.0.0 1.0.2 (Changelog)
org.apache.sling.installer.core 3.6.4 3.6.8 (Changelog) 3.8.6 (Changelog)
org.apache.sling.installer.factory.configuration 1.1.2 1.1.2 1.1.2
org.apache.sling.installer.factory.subsystems 1.0.0
org.apache.sling.installer.provider.file 1.1.0 1.1.0 1.1.0
org.apache.sling.installer.provider.jcr 3.1.16 3.1.18 (Changelog) 3.1.22 (Changelog)
org.apache.sling.javax.activation 0.1.0 0.1.0 0.1.0
org.apache.sling.jcr.api 2.2.0 2.3.0 (Changelog) 2.4.0 (Changelog)
org.apache.sling.jcr.base 2.2.2 2.3.2 (Changelog) 3.0.0 (Changelog)
org.apache.sling.jcr.compiler 2.1.0
org.apache.sling.jcr.contentloader 2.1.10 2.1.10 2.1.10
org.apache.sling.jcr.davex 1.2.2 1.3.4 (Changelog) 1.3.8 (Changelog)
org.apache.sling.jcr.jcr-wrapper 2.0.0 2.0.0 2.0.0
org.apache.sling.jcr.registration 1.0.2 1.0.2 1.0.2
org.apache.sling.jcr.repoinit 1.1.2
org.apache.sling.jcr.resource 2.5.0 2.7.4.B001 (Changelog) 2.9.2 (Changelog)
org.apache.sling.jcr.resourcesecurity 1.0.2 1.0.2 1.0.2
org.apache.sling.jcr.webdav 2.2.2 2.3.4 (Changelog) 2.3.8 (Changelog)
org.apache.sling.jmx.provider 1.0.2 1.0.2 1.0.2
org.apache.sling.launchpad.installer 1.2.0 1.2.2 (Changelog) 1.2.2
org.apache.sling.models.api 1.1.0 1.2.2 (Changelog) 1.3.2 (Changelog)
org.apache.sling.models.impl 1.1.0 1.2.2 (Changelog) 1.3.9.r1784960 (Changelog)
org.apache.sling.models.jacksonexporter 1.0.6
org.apache.sling.provisioning.model 1.8.0
org.apache.sling.repoinit.parser 1.1.0
org.apache.sling.resource.inventory 1.0.4 1.0.4 1.0.6 (Changelog)
org.apache.sling.resourceaccesssecurity 1.0.0 1.0.0 1.0.0
org.apache.sling.resourcecollection 1.0.0 1.0.0 1.0.0
org.apache.sling.resourcemerger 1.2.9.R1675563-B002 1.3.0 (Changelog) 1.3.0
org.apache.sling.resourceresolver 1.2.4 1.4.8 (Changelog) 1.5.22 (Changelog)
org.apache.sling.rewriter 1.0.4 1.1.2 (Changelog) 1.2.1.R1777332 (Changelog)
org.apache.sling.scripting.api 2.1.6 2.1.8 (Changelog) 2.1.12 (Changelog)
org.apache.sling.scripting.core 2.0.28 2.0.36 (Changelog) 2.0.44 (Changelog)
org.apache.sling.scripting.java 2.0.12 2.0.14 (Changelog) 2.1.2 (Changelog)
org.apache.sling.scripting.javascript 2.0.16 2.0.28 2.0.30
org.apache.sling.scripting.jsp 2.1.6 2.1.8 2.2.6
org.apache.sling.scripting.jsp.taglib 2.2.4 2.2.4 2.2.6
org.apache.sling.scripting.jst 2.0.6 2.0.6 2.0.6
org.apache.sling.scripting.sightly 1.0.2 1.0.18 1.0.32
org.apache.sling.scripting.sightly.compiler 1.0.8
org.apache.sling.scripting.sightly.compiler.java 1.0.8
org.apache.sling.scripting.sightly.js.provider 1.0.4 1.0.10 1.0.18
org.apache.sling.scripting.sightly.models.provider 1.0.0 1.0.6
org.apache.sling.security 1.0.10 1.0.18 1.1.2
org.apache.sling.serviceusermapper 1.2.0 1.2.2 1.2.4
org.apache.sling.servlets.compat 1.0.0.Revision1200172 1.0.0.Revision1200172 1.0.0.Revision1200172
org.apache.sling.servlets.get 2.1.10 2.1.14 2.1.22
org.apache.sling.servlets.post 2.3.6 2.3.8 2.3.15.r1780096
org.apache.sling.servlets.resolver 2.3.6 2.4.2 2.4.10
org.apache.sling.settings 1.3.6 1.3.8 1.3.8
org.apache.sling.startupfilter 0.0.1.Rev1526908 0.0.1.Rev1526908 0.0.1.Rev1764482
org.apache.sling.startupfilter.disabler 0.0.1.Rev1387008 0.0.1.Rev1387008 0.0.1.Rev1758544
org.apache.sling.tenant 1.0.2 1.0.2 1.1.0
org.apache.sling.tracer 0.0.2 1.0.2

AEM 6.3: set admin password on initial startup (Update)

With AEM 6.3 you have the chance to setup the admin password already on the initial start. By default the quickstart asks you for the password if you start it directly. That’s a great feature and shortens quite some deployment instructions, but it doesn’t work always.

For example, if you first unpack your AEM instance and then use the start script, you’ll never get asked for the new admin password. The same if you work in an application server setup. And if you do automatic installations, you don’t want to get asked at all.

But I found, that even in these scenarios you can set the admin password as part of the initial installation. There are 2 different ways:

  1. Set the system property “admin.password” to your desired password; and it will be used (for example add “-Dadmin.password=mypassword” to the JVM parameters).
  2. Or set the system property “admin.password.file” and pass as value the path to a file; when this file is accessible by the AEM instance and the contains the line “admin.password=myAdminPassword“, this value will be used as admin password.

Please note, that this only works on the initial startup. On all subsequent startups these system properties are ignored; and you should probably remove them or at least purge the file in case of (2).

Update: Ruben Reusser mentioned, that the Osgi Webconsole Admin password is not changed (which is used in case the repository is not running). So you still need to work on that.

What I check on code reviews

At several occassions I did code reviews on AEM projects  in the last months. I don’t do that exercise quite often, so I don’t have a real routine or checklists what to look at. But in the past I learned some lessons about how to write code for AEMs, so I hope I check relevant pieces. Feedback appreciated.

So let’s start with my top 10 items I look for:

  1. The use of “System.out.println()“, “System.err.println()” and “e.printStackTrace()” statements. Logging is cheap and easy, but obviously not easy enough, I still find these statements. They should be replaced, because these statements do not provide relevant metadata like time and class. And to be honest, people tend to look only at the error.log file, but not on stdout.log.
  2. Servlets bound to fixed paths. In most cases it should be replaced by binding either to a selector or to a resourcetype. The Sling documentation explains it quite well.
  3. The creation of dedicated sessions/ResourceResolvers (either admin sessions ot service user sessions) and if these sessions are properly closed. Although this should be common knowledge to AEM developers, there’s still code out there which doesn’t close resource resolvers or logs out sessions, causing memory leaks.
  4. The existence of long-running sessions. You shouldn’t write services, which open a session on activate and close them on deactive (see this blog post for the explanation). The only exception to this rule: JCR observation handlers.
  5. adaptTo() calls without proper null checks. adaptTo() is allowed to return null. There are cases where it can be neglected (in reality you’ll never have all occurrences of it checked), but in most cases it has to be checked to avoid NullPointerException.
  6. Log hygiene part 1: The excessive use of LOG.error(), when a LOG.info() is sufficient. Or even worse: LOG.error/info() instead of LOG.debug().
  7. Log hygiene part 2: Log messages without meaningful description. A log message has to contain relevant information like “with what parameter does this exception happen? At what node? Which request?”. Consider that some parts are always and implicitly logged (e.g. the thread name, which contains in case of a request the requested path), but you need to provide every other information which can be useful to understand the problem when found in the logs.
  8. The mixed usage of JCR and Sling API. Choose either one, but then stick to it. You should not have a method, which takes both a Session and a ResourceResolver as parameter (or other object from these domains).
  9. Performing string operations on paths. I already blogged about it.
  10. JCR queries. Are they properly used or can they get replaced by a short tree traversal?

So when you get through all of these quite generic items, your code is already quite well. And if you don’t have a specific problem which I should investigate, I will likely stop here. Because then you already proved, that you understand AEM and how to operate it quite well, so I wouldn’t expect major issues anymore.

I am sure that the personal background influences a lot the intuitive approach on code review. Therefor I am interested in your checklists and how it differs from mine. Just leave a comment, drop me a mail or tweet me 🙂

Let’s try to compile a list which we all can use to improve our code.

Application changes and incompatible features

In the lifecycle of every larger application there are many occasions where features evolve. In most cases the new version of a feature is compatible with earlier versions, but there are always cases where incompatible changes are required.

As example let me take the scenario the guys at KD WebConsult provided in their latest blog entry (which has inspired me to write this post, I have to admit). There is a component which introduces changes in the underlying repository structure, and the question is how to cope with that change in case of deployments.

I think that is a classical case of incompatible changes, which always result in additional effort; and that’s the reason why noone likes incompatible changes and tries to avoid them as much as possible. While in a pure AEM environment you should have the full control of all changes, it’s getting harder if you have system you depend on or systems depending on you. Then you run into the topic of interface lifecycle management. Making changes then gets hard and sometimes nearly impossible. You end up with supporting multiple versions of an interface at the same time. Duplicating code is then a way to cope with it. (The technical debt is not only on your side then, but also on the side of others no updating or able to update their use of the interfaces.)

So to come back to the KD Webconsult example I think that the cleanest solution is to build your component in a way, that it supports both the old and new the repository structure (their option 2). And if you are sure, that you don’t use the old structure anymore, you can safely remove the logic for it.

The thinking, that you can always avoid such situations and keep you code clean, is wrong. As soon as you are dealing with non-trivial setup (and AEM in an enterprise setup per se is a distributed application which comes along with other enterprisy requirements like high-availability) you have to make compromises. And taking technical debts for the time of a release or two is not necessarily a bad one if you can stick with standard processes (not changing deployment processes).

Automation done right — a reliable process

(the subtitle could be: Why I don’t call simple curl scripts „automation“).

Jakub Wadalowski mentioned in his talk on CircuitConf 2016, that „Jörg doesn’t like curl“. Which is not totally true. I love curl as a command line http client (to avoid netcat).

But I don’t like the way how people are using curl in scripts and then call these scripts „automation“; I already blogged about this topic.

Good AEM automation requires much more than just package installation. True automation has to work in a way, that you don’t need to perform any manual intervention, supervision or control. The automation has to work in a reliable and repeatable fashion.
But we also know, that there will ever be errors and problems, which prevent a successful result. So a package installation can work, but the bundles might come not up. In such cases the system has to actively report problems back. But this has to be accurate in a way, that you can rely on it: If no messaging occurs, everything is fine! There are no manual checks anymore!
This really requires a process you can trust. You need to trust your automated process so much, that you can start it and then go home.

So to come back to the „Jörg doesn’t like curl“ statement of Jakub: When you just have the curl calls and no error checking (I am not aware of an easy way to determine the status code of the HTTP request done with curl) and no proper error handling and reporting, it will never be reliable. It might save you a few clicks, but in the end you still have to perform a lot of manual steps. And to get away from these manual steps just with shell scripting, it requires a good amount of scripting.

And to be honest: I’ve rarely seen „good shell scripts“.

Storage sizing and SAN

A part of every project I’ve done in the last years was always the task to create a hardware sizing; many times it was part of the project setup and was a very important piece which got fed into the hardware provisioning process.

(Hardware sizing is often required during presales stages, where it is mostly used to compare different vendors in terms of investment into hardware. In such situations the process to create a sizing is very similar, but the results are often communicated quite differently …)

Depending on the organization this hardware provisioning process can be quite lengthy, and after the parameters have been set they are very hard to change. That is especially true with large organisations which want to use an on-premise deployment in their own datacenter; because then it means starting a purchase process followed by provisioning and installation, and so on. Especially in the FSI area it is not uncommon to have 6 months from the signed order of the budget manager to the handover of an installed server. And then you get exactly what you asked for. So everyone tries to make sure, that the input data is as good as possible, and that all variables have been considered. Reminds me a lot of the waterfall model in the software development business, and it comes with the same problems.

Even if your initial sizing was 100% accurate (which it never is), 6 months are a long time where also some important project parameters can change. So there is a not-so-small chance, that at the time the servers are handed over to you, you know that the hardware is not sufficient anymore, based on the information you have right now.
But changes are hardly possible, because for cost efficiency you ordered not the hardware which offered the most flexibility in terms of future growth, but a model with some constraints in terms of extendability. And now, 6 months after project start and just 4 months ahead of the golive date, you cannot restart the purchasing process anymore!

Or a bit worse: You are already live for 6 months and now you start to run short of the disk space, because your growth is much higher than anticipated. But the drive bays of your servers are already full and you have already implemented the largest disks available.

For the topic of disk space the solution can be quite easy: Don’t use local disks! Even if local SSDs are very performant in terms of IOPS, try to avoid the discussion and go for a SAN (Storage area network), which should be available already in an enterprise datacenter. (Of course you can also choose any different technology, which decouples the storage from the server in a strong way and performs well.) For AEM and TarMK a good SAN is sufficient to deliver decent performance (even if a local SSD improves this again).

I know that this statement can be problematic, as there are cases where IOPS are more important than the flexibility to easily add storage. Then your only chance is to take the SSDs and make sure, that you still have the chance to add more of them.

The benefit of a SAN is that you split the servers from their storage, and you can upgrade or extend them independently from each other. Adding more hard drives to a SAN is always possible, so you have hardly a limit in terms of available disk space per server. Attaching more disk space to a server is then a matter of minutes and can be done incrementally. This allows you also to attach disk space on demand instead of attaching the full amount of disk space on provisioning time (and consuming the full amount 2 years later).
And if you have servers and storage split up, it is also much easier to replace a server by another model with more capacity (RAM or CPU-wise), because you don’t need to move all data but rather just connect the storage to a different server.

So using a SAN does not free you up from delivering a good sizing, but it can soften the impacts of an insufficient sizing (mostly based on insufficient data), which is often the case on project kickoffs.

Managing repository growth

On almost every project there is this time, when (all of a sudden) the disk space assigned to AEM instances becomes full; in most (!) cases such a situation is detected early on so there is time to react, which often means just adding more disk capacity.

But after that the questions arise: Why is our repository so large? Why does it consume more than the initially estimated disk space? What content exactly is causing this problem? What went wrong so we actually got into that situation?

From my point of view there are 3 different views on this situation, which can be an answer:

  • Disk space is not managed well, that means that unnecessary stuff is consuming lot of space outside of the repository.
  • The maintenance jobs are not executed.
  • The estimation for content and content growth was not realistic.

And in most cases you need to check all these 3 views to answer the question “Why are we running out of space?”

Disk space is not managed well
This is an operations problems in the first place, because non-repository data is basically consuming disk space which has been planned for the repository. Often seen:

  • Heapdumps and log files are not rotated or removed.
  • Manually created backups files have just been copied to a folder next to the original one once, but in meanwhile they are not useful any more, because they are totally out of date.
  • The regular backup process is creating temporary files, which are not cleaned up; or the backup process itself consumes temporary disk space, which lets the disk consumption spike.

So this can just be handled by careful working and in-time purging of old data.

The maintenance jobs are not executed
Maintenance jobs are an essential part of the ongoing job to remove the unnecessary fat from you AEM instance, be it on the content level or on a repo level. It includes

  • workflow purge
  • audit log purge
  • repository compaction (if you use TarMK)
  • datastore GC (if you use a datastore)

You should always keep an eye on these; the Maintenance Dashboard is a great help here. But do not rely on it blindly!

Your estimation for content and content growth was not realistic
That’s a common problem; you have to give an initial hardware sizing, which also includes the amount of disk space used by the repository. You do your best to include all relevant parameters, you add some buffer on top. But that is an estimation on the beginning of the project, when you don’t know all the requirements and their impact on disk consumption in detail. But that’s what you said, and changing them afterwards is always problematic.

Or AEM is used differently than initially anticipated and all the assumptions you have based your initial hardware sizing are not longer true. Or you just forgot to add the versioning of the assets to your calculation. Or…
There are a lot of cases where in retrospective the initial sizing of the required disk space was just incorrect. In that case you have only chance: Redo the calculation right now! Take your new insights and create a new hardware sizing. And then implement it and add more hardware.

And the only way I see to avoid such situations is: Do not make estimations for the next 3 years! Be a agile and review your current numbers every 3 months; during this review you can also determine the needs for the next months and plan accordingly. Of course this assumes that you are flexible in terms of disk sizing, so for any non-trivial setup the use of SAN as storage technology (plus a volume manager) is my preferred choice!

Of course this does not free yourself from working on the cleanup of the systems and running all required maintenance jobs; but it will make the review of the used disk space a regular task; so you should see deviations from your anticipated plan much earlier.

What is writing to my Oak repository?

If you were ever curious what’s happening on the Oak repository in AEM (and I can tell: a lot!), there’s a chance to logs all the repo write actions.

Just set create a logger for „org.apache.jackrabbit.oak.jcr.operations.write“ on loglevel TRACE and there you go.


21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:userid]
21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:path]
21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:type]
21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:category]
21.05.2016 23:38:34.353 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/userId]
21.05.2016 23:38:34.354 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/path]
21.05.2016 23:38:34.354 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/type]
21.05.2016 23:38:34.354 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] Setting property [/var/audit/com.day.cq.wcm.core.page/content/geometrixx/en/services/59a34cc6-ee23-4423-9e76-03868cd5e7e6/cq:properties]
21.05.2016 23:38:34.356 *TRACE* [Thread-8145] org.apache.jackrabbit.oak.jcr.operations.writes [session-43731] save

This thread with the name [Thread-8145] writes to the /var/audit path, so it’s quite likely related to the Audit logger. And this is using the session [session-43731]; the session name is random per session, but it is a very useful information:

  • you can identify a single session and what’s happening within this session. It is especially useful to determine what this specific session is writing and how often there are saves.
  • If you have the session name, and it is a long-running session, you can look up this session in the JMX MBean console; in Oak 1.0 and Oak 1.2 a stack trace is stored when the session is opened; in Oak 1.4 this has been removed for performance reasons, but you can get it back when you set the System property ‚oak.sessionStats.initStackTraceThreshold‘ to ‚0‘ (zero).

So if you need to understand what’s happening on your repository and what might causing repository write activity, this is an easy way to go. The only drawback: Such logging eats up a lot of diskspace, especially if run it for an extended period of time.

It’s also possible to do this for reads, but at least on Oak 1.0 it doesn’t log the paths, but only the operation; so it’s less useful here. And it produces a lot of data: Having it enabled for 2 minutes and 1 page load on my local instance it produced 10 megabytes of logs.