Do not write to the repository in event handling

Repository writes are always resource intensive operations, which always come with a cost. First of all, the write operation helds a number of locks, which limit the concurrency in write operations in total.  Secondly the write itself can take some time, especially if the I/O is loaded or you are running in a cluster with MongoDB as backend; in this case it’s the latency of the network connection plus the latency of MongoDB itself. And third, every write operations causes async index updates, triggers JCR observation and Sling resource events etc. That’s the reason why you shouldn’t take write operations too easy.

But don’t be scared to write to the repo because of performance reason, no way! Instead try to avoid unnecessary writes. Either batch them and collapse multiple write operations into a single transaction, if the business case allows it. Or avoid the repository writes alltogether, especially if the information is not required to be persisted.

Recently I came across a very stressing pattern of write operations: Write amplification. A Sling Resource Event listener was listening for certain write events to the repository; and when one of these was received (which happened quite often), a Sling Job has been created to handle this event. And the job implementation just did another small change to the repository (write!) and finished.

In that case a single write operation resulted in:

  • A write operation to persist the Sling Job
  • A write operation performed by the Job implementation
  • A write operation to remove the Sling Job

Each of these “regular” write operations caused 3 subsequent writes to the repository, which is a great way to kill your write performance completely. Luckily no one of these 3 additional write operations caused the event listener to create a new Sling Job again … That would have caused the same effect as “out of office” notifications in the early days of Microsoft Exchange (which didn’t detect these and would have sent an “out-of-office” reply to the “out-of-office” sender): A very effective way of DOSing yourself!

a flood of writes and the remains of performance

But even if that was not the case, it resulted in a very loaded environment reducing the subjective performance a lot; threaddumps indicated massive lock contention on write operations. When these 3 additional writes have been optimized (effectivly removed, as collecting the information in memory and batch-writing it after 30 seconds was possible) the situation improved a lot, and the lock contention was gone.

The learning you should take away from this scenario: Avoid writing to the repository in JCR Observation or Sling event listeners! Collect the data and write outside of these handlers in a separate thread.

PS: An interesting side effect of sling event listeners taking a long time is, that these handlers are blacklisted after they took more than 5 seconds to process (e.g. because they have been blocked on writing). Then they are not fired again (until you restart AEM), if you don’t explicitly whitelist them or turn of this feature completly.