The problems of multi-tenancy: governance

A recurring topic in AEM projects is multi-tenancy. Wikipedia describes multitenancy as „[…] software architecture in which a single instance of a software […] serves multiple tenants“. In the AEM projects I’ve done I encountered this pattern most when a company wants to host several brands and/or subsidiaries as independent tenants within a single AEM platform (that means: connected authoring and publishing instances). In this blog post I only cover the aspect of multi tenancy in a single company. Hosting tenants for multiple independent companies is a different story and likely even more complex.

At first sight multi-tenancy seems to be only a technical problem (separation of content/templates/components, privileges, etc.), but from what I learned, there is a much bigger problem, which you should solve first. And that’s the aspect of organization and governance.

Multitenancy is hard when different tenants (being brand organizations or subsidiaries) need to integrate into the single platform. Each tenant has its own requirements (depending on its special needs), its own timelines, and its own budget. You have larger tenants and smaller tenants on your AEM platform. But this does not necessarily reflect the power of these tenants inside the company. It may even contradict, and a smaller or less powerful organization or brand has such demands, that it will be the largest tenant on your AEM platform.

That means, that there will be conflicts, when it comes to defining scope, timeline and budget. The tenant which contributes more budget wants to have more influence on these 3 aspects than another tenant, which spends a significant smaller amount. But the smaller tenant might have needs which can overrule this, for example a tradeshow where some new features on the brand pages are absolutely required, while the other tenant (yet more powerful within the organization) has requirements, which are important in a more distant future. How are these requirements prioritized?

These questions (and conflicts) are not new, they exist for decades, even not centuries. But they have a huge impact on the platform owner. The platform owner wants to satisfy the needs of all the tenants, but is often faced with contradicting requirements; while on the technical side these can be often (more or less) solved (just by throwing people and time onto the problem), there are still things which are in the first place organizational issues, and which can only be solved on a organizational or political level. Then you have topics like:

How can you coordinate different timelines of different tenants, so you can satisfy all their needs?
Tenants want to have their own development teams or agencies. How can they work together and feed their results into a single platform without breaking it? Who’s responsible when the platform broke down?
How do you do funding when tenants contribute development work to the platform and other benefits from this work as well? Invoicing the tenants which benefit from other tenant’s development work?
What’s the role of the platform owner? Does the platform have its own budget or is the platform solely funded by the tenants? Is the platform owner able to reject feature requests from tenants and say “no”?
How should the platform owner react with contradicting requirements? Is splitting the single platform into multiple ones (with different codebase) something which is desirable?

There are a lot of questions like these, and they are very specific to the company and the platform. They can all be solved, but the company and the organization itself has to solve them, but not the platform development team(s). Because then the organization foo will go down even to the developers (and as we all know: this kind of human being doesn’t really like that :-))

My ideal multi-tenancy project looks like this: A strong platform owner with some budget on its own. The tenants have pretty much the same size, and they fund the platform for the largest part to the same amount each. A steering committee (with participants from all tenants) deciding on all the organizational topics, and the same on the technical level if required. Requirements are consolidated on a project level and then implemented by a team, which is reporting to the platform owner.

Yeah, I have to admit, I haven’t found that customer project yet 🙂 But in such a project you as a member of the development team don’t really feel anymore the multi-tenancy aspect on an organizational level, but you only have to deal with it only on a technical level. Which is very nice.

AEM Basics: Runmodes

Today I want to discuss a feature, which is very basic and widely used. I want to discuss “runmodes“. You might already encountered it when you deployed an authoring instance and a publishing instance. Basically both can be deployed from the very same installation package, but just because of a magic string at the right place during installation the behaviour changes dramatically, one instance becomes an authoring instance, the other instance becomes a publishing instance. It’s because of the runmode you configured.

You can think of runmodes as labels or roles you attach to instances, and “author” and “publish” are just special ones. On runtime you can check for these labels and react accordingly (the SlingSettingsService is your friend here). A more sophisticated usecase is the OSGI configuration. Based on the location they config is placed, this config might be active or not, depending on the runmodes (see the AEM docs on this topic).

But the runmodes are not limited only to “author” or “publish”, but you can attach as many runmodes to an instance as you like. For example you can create labels indicating the environments of development (for example “integration” or “preproduction”), and you can have special configuration for these environments.This makes it a lot easier, if you want your application to behave differently on these environments as on production.

The best of all: When you use runmodes to differentiate your environments from each other, you can easily have all configurations for all environments in a single content package, and deploy this package to all environments, no matter if it’s the production or integration environments. If the runmode don’t match, it is just not getting active.

AEM coding best practice: Servlets

You might say, servlets are old technology. So old, that every Java web developer should know everything about them.

Yes, servlets exist since the 90’s of last century (to be exact: 1997), and the basics haven’t really changed. So what’s so special about servlets, that I decide to write a dedicated blog post on it and title it „AEM coding best practice“?

Well, there’s nothing special in terms of coding. All things which are recommended since 1997, can still be considered valid. But there’s some subtle difference between servlet development for AEM development and the development of servlets for other types of applications: AEM (or: Sling) is resource oriented.

This aspects makes it hard for developers, who normally bind servlets to hardcoded paths (either via annotations or via the web.xml bindings). Binding servlets to a path is still possible in Sling, but it is actually an anti-pattern. Because then this servlet is not bound to a real existing resource, and therefor a number of goodies of Sling are not applicable.

Instead I recommend you to bind the servlets to resource types. The first and probably most obvious reason is that you do not need to hardcode any path within your code (or config), but instead you can just move the resource type to the path where you like it to be, and then the servlet can called via this path. And the second benefit is, that you can apply access control on the JCR nodes backing the respective resource. If you don’t have read access on that resource, you can not call the servlet. Which is a great way to restrict access to certain functions to a number of users, without implementing access control in your own code! But just using the ootb features of the JCR repository.

So this „bind to a resource type“ should remind you pretty much to the way, how resources and their components are wired. A resource has the property „resource type“, which denotes the component use to render this resource. With a servlet you can specify the resource type, your servlet wants to handle. So it’s basically the same, and instead of JSPs or Sightly scripts you can also use servlets to implement components. You can also easily implement the handling of selectors or different extensions in servlets.

I do not recommend to drop JSPs and Sightly altogether and switch to Servlets unless your fronted developers speak Java fluently, now and for the next years. Sightly has been developed just for this specific purpose: Frontend stuff should be handled by fronted developers and must not require java development knowhow. Use Sightly whenever possible.

And finally a bookmark for everyone working with Sling: The Sling servlets and scripts documentation.

AEM scaling patterns: Avoid shared sessions

The biggest change in AEM 6.0 compared to its prior versions is the use of Apache Oak as repository implementation instead of Apache Jackrabbit version 2.x; although both implement the JCR 2.0 API (Oak not completely yet, but the „important“ parts are there), there a number of differences between them.

In the area of scalability the most notable change is the use of the MVCC (multi version concurrency control, and proven approach taken from the relational database world) in Oak. It decouples sessions from the global repository state and are the basis for the scalability of the repository. But it comes with the price, that sessions should be used only by a single thread. It is a only a „should“, because Oak detects any usage of multiple threads accessing a single session and then serializes the access to it.

(For the records: The same recommendation already applied to Apache Jackrabbit 2.x, but the impact was never that high, mostly because it wasn’t that scalable as Oak now is.)

This isn’t a real limitation, but it requires careful design of any application. In the context of AEM normally it isn’t a problem at all, because all incoming HTTP requests use a dedicated session on their own. While this is true for the request, there is often functionality, which doesn’t follow this pattern.

I put a common pattern for this development pattern to Github, including a recommended implementation and a discouraged implementation. The problem in the discouraged example lies in the fact, that the repository session (in the example hidden behind the resource resolver abstraction) is opened once at the startup of the service by the thread, which does the activation of all services. But then resources are handed out to every other thread requesting the getConfiguration() method. If every request is doing this call, they all get synchronized here, thus limiting the scalability.

In the recommended example this problem is mitigated in a way, that each call to getConfiguration() opens a new session, reads the required resource and then closes the session. Here the session and its data is hold completely inside a thread, and there’s no need for synchronization anymore.

That’s the theory part, but how can you detect easily if you have this problem as well? The easiest way is to set the logging for the class org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate to DEBUG. Every time Oak detects the problem, that a session is used by multiple threads, it prints a stack trace to the log. If this happens on write access, it uses the WARN level, in case of reads the DEBUG level.

23.02.2015 09:21:56.916 *WARN* [0:0:0:0:0:0:0:1 [1424679716845] GET /content/geometrixx/en/services.html HTTP/1.0] org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate Attempt to perform hasProperty while another thread is concurrently reading from session-494. Blocking until the other thread is finished using this session. Please review your code to avoid concurrent use of a session. java.lang.Exception: Stack trace of concurrent access to session-494 at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:276) at org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:113) at org.apache.jackrabbit.oak.jcr.session.NodeImpl.hasProperty(NodeImpl.java:812) at org.apache.sling.jcr.resource.JcrPropertyMap.read(JcrPropertyMap.java:350) ...

If you want to have a scalable AEM application, you should carefully watch out for these log messages and optimize the use of shared sessions.

Meta: I am on Summit 2015 in Salt Lake City

It’s always hard as a techie to get to a customer conference (especially to high-profile one); even as Adobe employee it is hard to justify when you go to the Adobe Summit. But with the support of some co-workers I made it. I am very proud to present on the Adobe Summit this year in Salt Lake City.

I will have a (hands-on) lab session called “Silver bullets for dobe Experience Manager success“. It will cover some aspects how you can use existing and well-known features of the AEM technology stack to get the most out of AEM. When you are an AEM expert I won’t tell you any news, but maybe give you some inspiration and ideas. But don’t be too late for registration, the 3 slots I have are filling quickly.

I am really looking forward to it and I hope to meet many of my readers there. Just drop me a note if you want to meet with me in person.

Thanks,
Jörg

AEM anti-pattern: The hardcoded content structure

One the first things I usually do when we start an AEM project is to get a clear vision of the content and its structure. We normally draw a number of graphs, discuss a number of use cases, and in the end we come up with a content structure, which satisfies the requirements. Then we implement this structure as a hierarchy of nodes and that’s it.

In many cases developers start to use this structure without too much thinking. They assume, that the node structure is always like this. They even start to hardcode paths and language names or mimic this structure. Sometimes that’s not a problem. But it is getting hard, when you are building a multi-language or multi-tenant site and you start simple with only 1 language and 1 tenant; then you might end up with these languages or tenants being hardcoded, as “there was no time to make it right”. Imagine when you start with the second language or the second site and someone hardcoded a language or a site name/path.

So, what can you do to avoid hardcoded paths? Some information is always stored at certain areas. For example you can store basic contact information on the root node, which you can reuse on the whole site. So how do you identify the correct root node if you have multiple sites? Or how do you identify the language of the site?

The easiest way is to mark these site root pages (I prefer pages here over nodes, as they can be created using the authoring UI and are much more easier authorable) with a certain property and value. The easiest way is then if you have a special template with its dedicated resource type. Then you can identify these root pages using 2 approaches:

When you need to find them all, use a JCR query and look for all pages with this specific resource type.
When you need to find the siteroot page for a given page (or resource), just iterate up the hierarchy until you find a page with this resource type.

This mechanism allows you to be very flexible in terms of the content hierarchy. You no longer depend on pages being on a certain level or having special names. It’s all dynamic and you don’t have any dependency on the content structure. This page doesn’t even have to be the root-page of the public facing site, but is just a configuration page used for administration and configuration purposes. The real root-page can be a child or grand-child of it. You have lot’s of choices then.

But wait, there is a single limitation: Every site must have a sitters page using this special template/resourcetype. But that isn’t a hard restriction, isn’t it?

And remember: Never do string operations on a content path to determine something, neither the language nor the site name. Never.

Connecting dispatchers and publishers

Today I want to cover a question which comes up every now and then (my gut feeling says this question appeared at least once every quarter for the last 5 years …):

How should I connect my dispatchers with the publishs? 1:1, 1:n or m:n?

To give you an impression how these scenarios could look like I graphed the 1:1 and the n:m scenario.

publish-dispatcher-connections-1-1-final — The 1:1 setup, where each dispatcher is connected to exactly 1 publish instance; for the invalidation every publish is also connected only with its assigned dispatcher instance.

publish-dispatcher-connections-N-M-final — The n:m setup, where n dispatcher connce to m publish instances (for illustration here with n=3 and m=3); each dispatch is connected via loadbalancer to each publish instance, but each publish instance needs to invalidate all dispatcher caches.

I want to give you my personal opinion and answer to it. You might get other answers, both from Adobe consultants and other specialists outside of Adobe. They all are valuable insights into the question, how it’s done best in your case. Because it’s your case which matters.

My general answer to this question is: Use a 1:1 connection for these reasons:

it’s easy to debug
it’s easy to monitor
does not require any additional hardware or configuration

From an high-availability point of view this approach seems to have a huge drawback: When either the dispatcher or the publish instance fails, the other part is not available as well.

Before we discuss this, let me state some facts, which I consider as basic and foundation to all my arguments here:

The dispatcher and the web server (I can only speak for Apache HTTPD and its derivates, sorry IIS!) are incredibly stable. In the last 9 years I’ve setup and operated a good number of web environments and I’ve never seen a crashing web server nor a crashing dispatcher module. As long as noone stops the process, this beast is handling requests.
A webserver (and the dispatcher) is capable to deliver thousands of requests per second, if these files originate from the local disks and just need to be delivered. That’s at least 10 times the number any publish can handle.
If you look for the bottleneck in handling HTTP requests in your AEM architecture it’s always the publish application layer. Which is exactly the reason why there is a caching layer (the dispatcher) in front of it.
My assumption is, that a web server on modern hardware (and operating systems) is able to deliver static files with a bandwidth of more than 500 mbit per second (at a mixed file scenario). So in most cases before you reach the limit of your web servers, you reach the limit of your internet connection. Please note, that this number is just a rough guess (depending on many other factors).

Based on these assumptions, let’s consider these scenarios in a 1:1 setup:

When the publish instance fails, the dispatcher instance isn’t fully operational anymore, as it does not reach its renderer instance anymore; so it’s best to take it out of the load balancing pool.
So does this have any effect on the performance capabilities of your architecture? Of course it has, it reduces your ability to deliver static files from the dispatcher cache. Which we could avoid if we had the dispatcher connected to other publishs as well. But as stated above, the delivery performance of static files isn’t a bottle neck at all, so when we take out 1 web server you don’t see any effect.
A webserve/dispatcher fails, and the connected publish instance is not reachable anymore, effectively reducing the power your bottleneck even more.
Admitted, that’s true; but as stated above, I’ve rarely seen a crashed web server; so this case is mostly true in case of hardware problems or massive misconfigurations.

So, your have an measurable impact only in case that a web server hardware went down, in all other cases it’s not a problem for the performance.

This is a small drawbacks, but from my point of view the other benefits stated above outweigh it by far.

This is my standard answer, when there’s no more specific information available. It’s a good rule of thumb. But if you have more specific requirement, it might have sense to change the 1:1 rule to a different one.

For example:

You plan to have 20 publish instances. Then it doesn’t make sense to have 20 webserver/dispatchers as well.
You want to serve a huge amount of static data (e.g. 100 TB of static assets), so your n copies of the same file get expensive in terms of disk space.

If you choose a different approach than the 1:1 scenario described in this blog post, please keep these factors in mind:

How do you plan to invalidate the dispatcher caches? Which publish instance will invalidate which dispatcher cache?
How do you plan to do maintenance of the publish instances?
What’s the effort to add or remove a new publish instance? What’s need to be changed?

Before you plan to spend a lot of time and effort into building a complex dispatcher scenario, please think if a CDN isn’t a more appropriate solution to your problem…

Writing health checks — the problem

I started my professional career in IT operation at a large automotive company, where I supported the company’s brand websites. There I learned the importance of a good monitoring system, which supports IT operations in detecting problems early and accurately. And I also learned that even enterprise IT monitoring systems are fed best with a dead-simple HTML page containing the string “OK”. Or some other string, which then means “Oh, something’s wrong!”.

In this post I want to give you some impression about the problematics of application monitoring, especially with Sling health checks in mind. Because it isn’t as simple as it sounds in the first place, and you can do things wrong. But every application should posses the ability to deliver some information about its current status, as the cockpit in your car gives you information about the available gas (or electricity) in your system.

The problem of async error reporting

The health check is executed when the reports are requested, so you cannot just push your error information to the health check as you log them during the processing. Instead you have to write them to a queue (or any other data structure), where this information is stored, until it is consumed.

The situation is different for periodical daily jobs, where only 1 result is produced every day.

Consolidating information

When you have many individual data, but you need to build a single data point about for a certain timeframe (say 10 seconds), you need to come up with a strategy to consolidate them. A common approach is to collect all individual results (e.g. just single “ok” or “not ok” information) and adding them to a list. When the health check status needs to be calculated, this list is iterated and the number of “OKs” and “not oks” is counted, the ratio is calculated and reported; and after that the list is cleaned, and the process starts again.

When you design such consolidation algorithms, you should always keep in mind how errors are reported. In the above mentioned case, 10 seconds full of errors would be reported only for a single reporting cycle as CRITICAL. The cycle before and after could be OK again. Or if you have larger cycles (e.g. 5 minutes for your Nagios) think how 10 seconds of errors are being reported, while in the remaining 4’50’’ you don’t have no problem at all. Should it reported with the same result as you have the same number of errors spread over this 5 minutes? How should this case be handled if you have decided to ignore an average rate of 2% failing transactions?

You see that you can you can spend a lot of thinking on it. But be assured: Do not try to be to sophisticated. Just take a simple approach and implement it. Then you’re better than 80% of all projects: Because you have actually reasoned about this problem and decided to write a health check!

About JCR queries

In the past days 2 interesting blog posts have been written about the use of JCR query, Dan Klco’s “9 JCR-SQL2 queries every AEM developer should know” and “CQ Queries demystified” by @ItGumby.

Well, when you already have read my older articles about JCR query (part 1 and part 2), you might get the impression that I am not a big fan of JCR queries. There might be situations where that’s totally true.

When you come from a SQL world, queries are the only way to retrieve data; therefor many developers tend to use query without ever thinking about the other way offered by JCR: the “walk the tree” approach.

@ItGumby gives 2 reasons, why one should use JCR query: efficiency and flexibility in structure. First, efficiency depends on many factors. In my second post I try to explain which kind of query are fast, and which ones aren’t that fast. Just because of the way the underlaying index (even with AEM 6.0 it’s in 99,9% still Lucene) is working. With the custom indexes in AEM6 we might have a game changer for this.
Regarding flexibility: Yes, that’s a good reason. But there are cases, where you have a specific structure, when you are looking for hits only in a small area of the tree. But if you need to search the complete tree, a query can be faster.

Dan gives a number of good examples for JCR queries. And I wholeheartedly admit, that the number of JCR SQL examples in the net is way too low. The JCR specification is quite readable for a large part, but I was never really good at implementing code when I only have the formal description of the syntax of the language. So a big applause to Dan!
But please allow me the recommendation to test every query first on production content (not necessarily on your production system!), just to find out the timing and the number of results. I already experienced cases, where an implementation was fast on development but painfully slow on production just because of this tiny aspect.

Managing multiple instances of services: OSGI service factories

And now the third post in my little series of OSGI related postings. I already showed how you can easily manage a number of services implementing the same interface using a service tracker or by using the right SCR annotations.

Sometimes you need to implement services, which just differ by configuration. A nice example for this is the logging, where you want to have the possibility to have multiple logging facilities being logged to log files at a different level. Somebody (normally the admin of the system) is then able to leverage this and configure the logging as she likes.

So more formally spoken you have zero to N instances of the same service, but just with different configuration. Just duplicating the code and create a logger1 with configurable values, a logger2, logger3 and so doesn’t make sense, as it’s just code duplication and inflexible (what happens, when you need logger100?).

OSGI offers for this kind of problem the concept of ServiceFactories. As the name already says, it’s a factory to create services, just by configuration.

As an example let’s assume, that you need to send out emails via a configurable number of SMTP servers, because for internal emails you need to use a different mailserver than for external users or partners. We will implement this as service, and on the service we will configure the details of the SMTP service we intend to use.

So let’s start with the service interface:

public interface MailService {
  public void sendMail (String from, String to, String body);
}

And a dummy implementation could look like this; we want to focus only on the properties, and not really on the details how to send emails 🙂

@Service
@Component(metatype=true,label="simple mailservice", description="simple mailservice")
public class MailServiceImpl implements MailService {

  private static final String DEFAULT_ADDRESS="localhost:25";
  @Property (description="adress of the SMTP server including port",value=DEFAULT_ADDRESS)
  private static final String ADDRESS = “mailservice.address”;
  private String address;
  
  private static final String DEFAULT_USERNAME="admin";
  @Property(description=“username to login to the SMTP server”,value=DEFAULT_USERNAME)
  private static final String USERNAME = “mailservice.username”;
  private String username;

  private static final String DEFAULT_PASSWORD="admin;
  @Property(description=“password to login to the SMTP server”,value=DEFAULT_PASSWORD)
  private static final String password = “mail service.password”;
  private String password;

  @Activate
  protected void activate (ComponentContext ctx) {
    address = PropertiesUtil.toString(ctx.getProperties().get(ADDRESS),DEFAULT_ADDRESS);
    username = PropertiesUtil.toString(ctx.getProperties().get(USERNAME),DEFAULT_USERNAME);
    password = PropertiesUtil.toString(ctx.getProperties().get(PASSWORD),DEFAULT_PASSWORD);
  }

  public void sendMail(String from, String to, String body) {
    // login to the smtp server using address, username and password provided via OSGI 
    // properties and send the email
  }
}

But how can you extend this and make sure, that you cannot just create a configuration for 1 mailserver, but for multiple ones? Easy, just add the property “configurationFactory=true” the @Component annotation.

@Component(metatype=true,label="simple mailservice", description="simple mailservice",
  configurationFactory=true)

If you compile and deploy your service now, you can see in your Apache Felix Configuration Manager, that you have a “plus” sign in front of the service; and when you click it, you get a new instance of your service, which you can configure.

ConfigurationFactory in the Felix Console

But when you have 3 mail services configured, which one do you get when you have something like this:

@Reference
MailService mailService;

The answer: It’s not deterministic. You might get any of the configured mailservices. If you need a special one, the easiest way is to provide labels for them to make them unique. So add another property to your service:

@Property(description=“Label for this SMTP service”)
private static final String NAME = “mailservice.label”

And when you create then a configuration with the label “INTERNET”, you can reference exactly this service instance with this kind of reference:

@Reference(“(mailservice.label=INTERNET)”)
MailService mailService;

This will resolve correctly when you have a mailservice service configured with the label “INTERNET”. As long as you don’t have such a service, any service containing such a reference won’t start (unless you create a dynamic reference …)

If you want to be more flexible and also implement some more logic in the lookup process (e.g. having a default mailservice or supporting a number of INTERNET mail services), you can use the whiteboard pattern to track all available MailService instances; based on their labels you can implement any logic you need.

As you see, OSGI is quite powerful when it comes to looking up and connecting to services. Combined with the power of SCR you can easily create lot of configurable services with very little effort. Managing these services and doing proper lookup is also just a few lines of code away.

Personally I really like the possibilities I have with the OSGI container inside of AEM, it gives me the flexibility to access lots of different parts of the system. And creating and injecting my own services is easier than ever.