Disk space got cheap today, my favorite dealer currently sells 2 Terabyte disks for less than 100 Euro. Of course this price does not cover redundancy, backup, and of course enterprises don’t buy consumer hardware… Though, even for them 1 terabyte of disk space should be affordable. Well, sometimes you don’t get that impression. You have to fight for funding each and every gigabyte. So, sometimes disk consumption is a real problem…
So, let’s discuss some causes for disk consumption with CRX and TarPM. The TarPM consists of 4 parts:
- the tar files
- the journal
- the index
- the datastore
The tar files
The TarPM is an append-only storage and therefor every action done on the repository; see the documentation on the docs. In the default configuration it consumes 100 megabyte for a single version file. So this shouldn’t be a problem in any case (even if you don’t do clustering, you can neglect the additional disk space consumption).
The index
The index is Apache Lucene based, and a has a add-and-merge strategy; while additional segments are added as dedicated files, these segments are merged in times of lowlevel acitivities to the main index. And when content is removed (and therefor shouldn’t be available anymore in the index), the respective parts of the index are deleted. So the index shouldn’t give you any problems. In case the index grows too large, you can tweak the indexing.xml, but for normal usecases that shouldn’t be necessary.
The datastore
As the TarPM the datastore (see the docs) is an append-only storage in the first place. Large binaries are placed there and referenced from the TarPM; but there are no references from the datastore to the TarPM, so when a node is removed in the tarPM, this reference is removed, but not the datastore object. That’s the case because it might be referenced from other nodes as well, but at this point this is not known. So we might have objects in the datastore which are not referenced anymore. The cleanup these the Datastore Garbage Collection checks all references to the datastore and removes the objects, which are not needed any more.
What you should be aware of
If you need to reclaim storage, consider this:
- Versioning: Just removing pages is often not sufficient to reclaim the space they consume, because they are still kept as versions.
- Large childnode lists can affect the size of the TarPM itself quite heavily: Every addition of a new child creates a new copy of the parent node (including a childnode list, which continously grows). So beside the performance impact of such large lists there is also an impact on disk consumption.
- As a recommendation run the TarOptimizer on a daily basis, and the Datastore Garbage Collection on a monthly (or even quarterly) basis. First run the TarOptimizer, then the Datastore Garbage Collection.
Great post. A few questions/comments… Why should Tar Opt be run before GC? Also, if there is heavy content authoring, such as when content is first being loaded, I would recommend running/cron jobbing the GC to run once a week.
The smaller the TarPM, the faster the DatastoreGC 🙂