The web, an eventually consistent system

For many large websites, CDNs are the foundation for delivering content quickly to their customers around the world. The ability of CDNs to cache responses close to consumers also allows these sites to operate on a small hardware footprint. However, compared to what they would have to invest if they operated without a CDN and delivered all content through their own systems, this comes at a cost: your CDN may now deliver content that is out of sync with your origin because you changed the content on your own system. This change is not done in an atomic fashion. This is the same “atomic” as in the ACID principle of database implementations.
This is a conscious decision, and it is caused primarily by the CAP theorem. It states that in a distributed data storage system, you can only achieve 2 of these 3 guarantees:

  • Consistency
  • Availability
  • Partition tolerance

And in the case of a CDN (which is a highly distributed data storage system), its developers usually opt for availability and partition tolerance over consistency. That is, they accept delivering content that is out of date because the originating system has already updated it.

To mitigate this situation the HTTP protocol has features built-in which help to mitigate the problem at least partially. Check out the latest RFC draft on it, it is a really good read. The main feature is called “TTL” (time-to-live) and means that the CDN delivers a version of the content only for a configured time. Afterwards the CDN fetches a new version will from the origin system. The technical term for this is “eventual consistent” because at that point the state of the system with respect to that content is consistent again.

This is the approach all CDNs support, and it works very reliable. But only if you accept that you change content on the origin system and that it will reach your consumers with this delay. The delay is usually set to a period of time that is empirically determined by the website operators, trying to balance the need to deliver fresh content (which requires a very low or no TTL) with the number of requests that the CDN can answer instead of the origin system (in this case, the TTL should be as high as possible). Usually it is in the range of a few minutes.

(Even if you don’t use a CDN for your origin systems, you need these caching instructions, otherwise browsers will make assumptions and cache the requested files on their own. Browsing the web without caching is slow, even on very fast connections. Not to mention what happens when using a mobile device over a slow 3G line … Eventual consistency is an issue you can’t avoid when working on the web.)

Caching is an issue you will always have to deal with when creating web presences. Try to cache as much as possible without neglecting the need to refresh or update content at a random time.

You need to constantly address eventual consistency. Atomic changes (that means changes are immediately available to all consumers) are possible, but they come at a price. You can’t use CDNs for this content; you must deliver it all directly from your origin system. In this case, you need to design your origin system so that it can function without eventual consistency at all (and that’s built in into many systems). Not to mention the additional load it will have to handle.

And for this reason I would always recommend not relying on atomic updates or consistency across your web presence. Always factor in eventual consistency in the delivery of your content. And in most cases even business requirements where “immediate updates” are required can be solved with a TTL of 1 minute. Still not “immediate”, but good enough in 99% of all cases. For the remaining 1% where consistency is mandatory (e.g. real-time stock trading) you need to find a different solution. And I am not sure if the web is always the right technology then.

And as an afterthought regarding TTL: Of course many CDNs offer you the chance to actively invalidate the content, but it often comes with a price. In many cases you can invalidate only single files. Often it is not an immediate action, but takes seconds up to many minutes. And the price is always that you have to have the capacity to handle the load when the CDN needs to refetch a larger chunk of content from your origin system.

The problems of multi-tenancy: the development model

in large enterprises AEM project tends to attract many different interested parties, which all love to make use of the features of AEM. They want to get onboard the platform as fast as they can. And this can be a real problem when it comes to such a multi-tenancy AEM platform.

In the previous post I wrote about the governance problems with such projects and all the politics involved in it. These problems pursue also in the daily business of the development and operation of such platforms.

Many of these tenants already have their development partners and agencies, which they are used to work with. These partners have experience in that specific area and know the business. So it’s quite likely, that the tenants continue to work with their partners also in this specific project. And there the technical problems starts.

Because at that point, you’ll realize, that you have multiple teams, which rarely collaborate or in worst case not at all. Teams which might have different skill levels, operate in different development models and use a different tooling. And each one of these teams gets its own prioritization and has its own schedule, and in most cases the amount of communication between these teams is quite low.

So now the platform owner (or the development manager on behalf) needs to setup a development model, which allows these multiple teams to feed all their results into a single platform. A model which doesn’t slow down your development agility and does not negatively impact the platforms stability and performance. And this is quite hard.

A number of these challenges are (note: most of them are not specific to AEM at all!):

  • How can you ensure communication and collaboration between all development parties? That’s often a part, which is left out (or forgotten) during time and budget estimation, therefor the amount of time spent on it is reflecting this fact. But that’s the most important piece here.
  • On the other hand, how do you make sure, that overhead of communication and coordination is as low as possible? In most cases this means, that each party gets its own version control system, its own maven module and its own build jobs. This allows a better separation of concerns during development and build time , but just postpones the problem. Because …
  • How you avoid the case, that multiple parties use the same names, which have to be unique? For example the same path below /apps or the same client library name? It’s hard to detect this at development time, when you don’t have checks, which cover multiple repositories and maven modules.
  • Somehow related: How do you handle dependencies to the same library but with different versions? Although OSGI supports this also during runtime, AEM isn’t really prepared for such a situation, that you should have a library in both version 1 and version 2. So you need to centrally manage the list of 3rd libraries (including version numbers), which the teams can use.
  • A huge challenge is testing. When you managed to deploy all delivered artifacts to a single instance (and combining these artifacts into deployable content packages often imposes its own set of problems), how do you test and where do you report issues? How happens the  triaging process to assign the issues to the individual teams for fixing? This can cause very easily a culture of blaming and denying, which make the actual bug fixing part very hard.
  • The same with production problems. No tenant and therefor no development team wants to get blamed for bringing down the platform because of some issue, so each problem can get very political, and teams start to argument, why they are not responsible.
  • And many more…

These are real world problems, which hurt productivity.

My thoughts how you can overcome (at least) some of the problems:

  • The platform owner should communicate open to all tenants and involved development teams, and encourage them to adhere to a common development model.
  • The platform owner should provide clear rules how each team is supposed to work, how they create and share their artifacts, and also clear rules for coding and naming.
  • The platform owner should be in charge for a small team which is supporting all tenants and all development teams and helps to align requirements and the integration of the different codebases. This team is also responsible for all the 3rd party library management and should have write access to the code repositories of all development teams.
  • Build and deployment is centralized as well.
  • Issue triaging is a cross-team effort.

This is all possible in a setup, where the platform owner is not only a function, which is not only responsible to run the platform, but also allowed to exercise control over the deployment artifacts of the individual parties.

Some sidenote: There is an architectural style called „micro services“, which seems to get traction at the moment. It claims to address the „many teams working on a single platform“ problem as well. But the whole idea is based on the split of monolithic application into single self-contained services, which does not really apply to this multi-tenancy problem, where every tenant wants to customize some aspects of the common system for itself. If you apply this approach to this multi-tenancy problem here, you end up with a multi-platform architecture, where each tenant has its own version of the platform.

Connecting dispatchers and publishers

Today I want to cover a question which comes up every now and then (my gut feeling says this question appeared at least once every quarter for the last 5 years …):

How should I connect my dispatchers with the publishs? 1:1, 1:n or m:n?

To give you an impression how these scenarios could look like I graphed the 1:1 and the n:m scenario.

publish-dispatcher-connections-1-1-final
The 1:1 setup, where each dispatcher is connected to exactly 1 publish instance; for the invalidation every publish is also connected only with its assigned dispatcher instance.

publish-dispatcher-connections-N-M-final
The n:m setup, where n dispatcher connce to m publish instances (for illustration here with n=3 and m=3); each dispatch is connected via loadbalancer to each publish instance, but each publish instance needs to invalidate all dispatcher caches.

I want to give you my personal opinion and answer to it. You might get other answers, both from Adobe consultants and other specialists outside of Adobe. They all are valuable insights into the question, how it’s done best in your case. Because it’s your case which matters.
My general answer to this question is: Use a 1:1 connection for these reasons:
  • it’s easy to debug
  • it’s easy to monitor
  • does not require any additional hardware or configuration
From an high-availability point of view this approach seems to have a huge drawback: When either the dispatcher or the publish instance fails, the other part is not available as well.
Before we discuss this, let me state some facts, which I consider as basic and foundation to all my arguments here:
  • The dispatcher and the web server (I can only speak for Apache HTTPD and its derivates, sorry IIS!) are incredibly stable. In the last 9 years I’ve setup and operated a good number of web environments and I’ve never seen a crashing web server nor a crashing dispatcher module. As long as noone stops the process, this beast is handling requests.
  • A webserver (and the dispatcher) is capable to deliver thousands of requests per second, if these files originate from the local disks and just need to be delivered. That’s at least 10 times the number any publish can handle.
  • If you look for the bottleneck in handling HTTP requests in your AEM architecture it’s always the publish application layer. Which is exactly the reason why there is a caching layer (the dispatcher) in front of it.
  • My assumption is, that a web server on modern hardware (and operating systems) is able to deliver static files with a bandwidth of more than 500 mbit per second (at a mixed file scenario). So in most cases before you reach the limit of your web servers, you reach the limit of your internet connection. Please note, that this number is just a rough guess (depending on many other factors).
Based on these assumptions, let’s consider these scenarios in a 1:1 setup:
  • When the publish instance fails, the dispatcher instance isn’t fully operational anymore, as it does not reach its renderer instance anymore; so it’s best to take it out of the load balancing pool.
    So does this have any effect on the performance capabilities of your architecture? Of course it has, it reduces your ability to deliver static files from the dispatcher cache. Which we could avoid if we had the dispatcher connected to other publishs as well. But as stated above, the delivery performance of static files isn’t a bottle neck at all, so when we take out 1 web server you don’t see any effect.
  • A webserve/dispatcher fails, and the connected publish instance is not reachable anymore, effectively reducing the power your bottleneck even more.
    Admitted, that’s true; but as stated above, I’ve rarely seen a crashed web server; so this case is mostly true in case of hardware problems or massive misconfigurations.
So, your have an measurable impact only in case that a web server hardware went down, in all other cases it’s not a problem for the performance.
This is a small drawbacks, but from my point of view the other benefits stated above outweigh it by far.
This is my standard answer, when there’s no more specific information available. It’s a good rule of thumb. But if you have more specific requirement, it might have sense to change the 1:1 rule to a different one.
For example:
  • You plan to have 20 publish instances. Then it doesn’t make sense to have 20 webserver/dispatchers as well.
  • You want to serve a huge amount of static data (e.g. 100 TB of static assets), so your n copies of the same file get expensive in terms of disk space.
If you choose a different approach than the 1:1 scenario described in this blog post, please keep these factors in mind:
  • How do you plan to invalidate the dispatcher caches? Which publish instance will invalidate which dispatcher cache?
  • How do you plan to do maintenance of the publish instances?
  • What’s the effort to add or remove a new publish instance? What’s need to be changed?
Before you plan to spend a lot of time and effort into building a complex dispatcher scenario, please think if a CDN isn’t a more appropriate solution to your problem…