An old rule of thumb, even on earlier versions of CQ5, is “when you do large repository operations, do a session.save() every 1000 nodes”. The justification for this typically, that this is the default of the Package Manager, and therefor it’s a kind of recommended approach. And to be honest, I don’t know the real reason for it, even though I work in the Day/Adobe ecosystem for quite some time.
But nevertheless, with Oak the situation has changed a bit. Limits are much more explicit, and this rule of “every 1000 nodes do a save” can be considered still as true statement. But let me give you some background on it, why this exists at all. And then let’s find out, if this rule is still safe to use.
In the JCR specification there is the concept of transient space. This transient space holds all activities performed on a session until an implicit or explicit save() of the session. So the transient space holds all temporary data of a transaction, and the save() is comparable to the final commit of a transaction.
This transient space is typically hold inside the java heap, so dealing with it is fast.
But by definition this transaction is not bound in terms of size. So technically you should be able to create sessions, which modify all nodes and every property of a repository of 2 TB size. This transient space does not fit into heap of a standard size (say: 12GB) any more. In order to support this behavior nevertheless, Oak starts to move this transient space entirely into the storage (TarMK, Mongo) if the transient space is getting too large (in the DocumentNodeStore language this is called a “persistent branch”, see the documentation of the DocumentNodeStore for some details on branches); then the size of the transaction is only limited by the amount of free storage on the persistance, but no longer by the size of the Java heap.
The limit is called update.limit and by default this 10k (up to and including Oak 1.4/AEM 6.2, 100k starting with Oak 1.6/AEM 6.3, see OAK-3036. But of course you can change this value using “-Doak.update.limit=40000”.
This value describes the amount of changes a transient space for a single session can hold before it is moved into the persistence. A change is a any change to a property or a node (adding/removing/modifying/reordering/…).
OK, that’s the theory, but what does this mean for you?
First, if the transient space is swapped to the persistence, the final session.save() will take much longer compared to a transient space in memory. Because to do the save, the transient space needs to be read from the persistence first (which typically includes at least disk I/O, in cases of MongoDB network I/O, which is even slower).
And second, when you add or change nodes, you typical deal with properties as well. So if you are on AEM 6.2 or older, you should check that you don’t do too much changes within a session, so you don’t hit this “10’000 changes” limit and get the performance penalty. If you have a reasonable content structure, the above mentioned rule of thumb of “do a save every 1000 nodes” goes into the very right direction.
But that’s often not good enough, because the updates of the synchronous Oak indexes count towards the 10’000 changes as well. You might know, that the synchronous indexes mirror the JCR tree, thus adding 1000 JCR nodes will also add 1000 oak nodes for the nodetype index. And that’s not the only synchronous index…
Thus increasing the update.limit to a higher number makes pretty much sense just to be on the safe side. But there is a drawback when you have such large limits: It’s the size of the transient space. Imagine you upload 1000 assets (1 MB each) into your repository in a single session, and you have the update.limit set to 100’000. The number of changes will not reach the update.limit, that’s unlikey. But your transient space will consume 1 GB of heap at least! Is your system designed and setup to handle this? Do you have enough free JVM heap?
Let’s conclude: The rule of thumb “do a save every 1000 nodes” might be a bit too optimistic on AEM 6.2 and older (with default values), but ok on AEM 6.3. But always keep the amount of transient space in mind. It can overflow your heap and debugging out-of-memory situations is not nice.
If you are interested in the inner working of Oak, look at this great piece of documentation. It covers a lot of lowlevel concepts, which are useful to know when you deal with the repository more often.
Jorg, thanks for the good summary on this topic.