One recurring question I see in the Adobe internal communication channels is like this: “For our customer X we need to know how long Adobe stores backups for our CS instances”.
The obvious answer to this is “7 days” (see the documentation) or “3 months” (for Offsite backup), because the backup is designed only to handle cases of data corruption of the repository. But in most cases there is a followup question “But we need access to backup data up to 5 years”. Then it’s clear that this question is not about backup, but rather about content archival and compliance. And that’s a totally different question.
TL;DR
When you need to retain content for compliance reasons, my colleagues are happy to discuss the details with you. But increasing the retention period for your backups is not a solution for it.
Compliance
So what does “content archival and compliance” mean in this situation? For regulatory and legal reasons some industries are required to retain all public statements (including websites) for some time (normally 5-10 years). And of course the implementation of that is up to the company itself. And it seems quite easy to implement an approach which holds the backups for up these 10 years around.
Some years back I spent some time on the drawing board to design a solution for an AEM on-prem customer; their requirement was to be able to prove what at any time within these 10 years was displayed to customers on their website.
We initially also thought about keeping backups around for 10 years; but then we came up with these questions:
- When the content is required, a restore from that backup would be required to an environment which can host this AEM instance. Is such an environment (servers, virtual machines) available? How much of these environments would be required, assuming that this instance would be required to run for some months (throughout the entire legal process which requires content from that time)?
- Assuming that an 8y old backup must be restored, are there still the old virtual machine images with Redhat Linux 7 (or whatever OS) around? Is it okay from a compliance perspective to run these old and potentially unsupported OS versions even in a secured network environment? Is the documentation still around which describes to install all of that? Does your backup system still support a restore to such an old OS version?
- How would you authenticate against such an old AEM version? Would you require your users to have their old passwords at hand (if you authenticate against AEM), or does your central identity management still support the interface this old AEM version is trying for authentication?
- As this is a web page, is it ensured that all external references, which are embedded into the page are also available? Think about the Javascript and CSS libraries, which are often just pulled from their respective CDN servers.
- How frequently must a backup be stored? Is it okay and possible to store just the authoring instance every quarter and do not perform any cleanup (version cleanup, workflow purge, …) in that time and have all content changes versioned, so you can use the restore functionality to go back to the requested time? Or do you need to store a backup after each deployment, because each deployment has the chance to change the UI and introduce backwards incompatible changes, which render the restored content not to work anymore? And would you need to archive the publish instance as well (where normally no versions are preserved)? And are you sure that you can trust the AEM version storage enough, so you can rely on JCR versioning to recreate any intermediary states between those retained backups?
- When you design such a complex process, you should definitely test the restore process regularly.
- And finally: What are the costs of such a backup approach? Can you use the normal backup storage, or do you need a special solution which guarantees that the stored data cannot be tampered with?
You can see that the list of questions is long. I don’t say it is impossible, but it requires a lot of work and attention to detail.
In my project the deal breaker was the calculated storage cost (we would have required a dedicated storage, as the normal backup storage did not provide the required guarantees for archival purposes). So we decided to take a different approach, and we added a custom process which creates a PDF/A out of every activated page and stores it in the dedicated archival solution (assets are stored as is). This adds upfront costs (a custom implementation), but is much cheaper on the long run. And on top if it does not need IT to access the old version of the homepage of January 23, 2019; but instead the business users or legal can directly access the archive and fetch the respective PDF of the time they are interested in.
In AEM CS the situation is a bit different, because the majority of the questions above deal with “old AEM vs everything else around is current”, and many aspects are not relevant for customers anymore; they are in the domain of Adobe instead. But I am not aware that Adobe ever planned to setup such a time machine, which allows to re-create everything at a specific point in time (besides all implications of security etc), mostly because “everything” is a lot.
So, as a conclusion: Using backups for content archival and compliance is not the best solution. It sounds easy at first, but it raises a lot of question if look into the details. The longer you need to retain these AEM backups, the more likely will it be that inevitable changes in the surrounding environments makes a proper function harder or even impossible.
One thought on “AEM CS Backup, Restores and Archival”
Comments are closed.