Meta: I am on Summit 2015 in Salt Lake City

It’s always hard as a techie to get to a customer conference (especially to high-profile one); even as Adobe employee it is hard to justify when you go to the Adobe Summit. But with the support of some co-workers I made it. I am very proud to present on the Adobe Summit this year in Salt Lake City.

I will have a (hands-on) lab session called “Silver bullets for dobe Experience Manager success“. It will cover some aspects how you can use existing and well-known features of the AEM technology stack to get the most out of AEM. When you are an AEM expert I won’t tell you any news, but maybe give you some inspiration and ideas. But don’t be too late for registration, the 3 slots I have are filling quickly.

I am really looking forward to it and I hope to meet many of my readers there. Just drop me a note if you want to meet with me in person.

Thanks,
Jörg

AEM anti-pattern: The hardcoded content structure

One the first things I usually do when we start an AEM project is to get a clear vision of the content and its structure. We normally draw a number of graphs, discuss a number of use cases, and in the end we come up with a content structure, which satisfies the requirements. Then we implement this structure as a hierarchy of nodes and that’s it.

In many cases developers start to use this structure without too much thinking. They assume, that the node structure is always like this. They even start to hardcode paths and language names or mimic this structure. Sometimes that’s not a problem. But it is getting hard, when you are building a multi-language or multi-tenant site and you start simple with only 1 language and 1 tenant; then you might end up with these languages or tenants being hardcoded, as “there was no time to make it right”. Imagine when you start with the second language or the second site and someone hardcoded a language or a site name/path.

So, what can you do to avoid hardcoded paths? Some information is always stored at certain areas. For example you can store basic contact information on the root node, which you can reuse on the whole site. So how do you identify the correct root node if you have multiple sites? Or how do you identify the language of the site?

The easiest way is to mark these site root pages (I prefer pages here over nodes, as they can be created using the authoring UI and are much more easier authorable) with a certain property and value. The easiest way is then if you have a special template with its dedicated resource type. Then you can identify these root pages using 2 approaches:

  • When you need to find them all, use a JCR query and look for all pages with this specific resource type.
  • When you need to find the siteroot page for a given page (or resource), just iterate up the hierarchy until you find a page with this resource type.

This mechanism allows you to be very flexible in terms of the content hierarchy. You no longer depend on pages being on a certain level or having special names. It’s all dynamic and you don’t have any dependency on the content structure. This page doesn’t even have to be the root-page of the public facing site, but is just a configuration page used for administration and configuration purposes. The real root-page can be a child or grand-child of it. You have lot’s of choices then.

But wait, there is a single limitation: Every site must have a sitters page using this special template/resourcetype. But that isn’t a hard restriction, isn’t it?

And remember: Never do string operations on a content path to determine something, neither the language nor the site name. Never.

Connecting dispatchers and publishers

Today I want to cover a question which comes up every now and then (my gut feeling says this question appeared at least once every quarter for the last 5 years …):

How should I connect my dispatchers with the publishs? 1:1, 1:n or m:n?

To give you an impression how these scenarios could look like I graphed the 1:1 and the n:m scenario.

publish-dispatcher-connections-1-1-final
The 1:1 setup, where each dispatcher is connected to exactly 1 publish instance; for the invalidation every publish is also connected only with its assigned dispatcher instance.
publish-dispatcher-connections-N-M-final
The n:m setup, where n dispatcher connce to m publish instances (for illustration here with n=3 and m=3); each dispatch is connected via loadbalancer to each publish instance, but each publish instance needs to invalidate all dispatcher caches.
I want to give you my personal opinion and answer to it. You might get other answers, both from Adobe consultants and other specialists outside of Adobe. They all are valuable insights into the question, how it’s done best in your case. Because it’s your case which matters.
My general answer to this question is: Use a 1:1 connection for these reasons:
  • it’s easy to debug
  • it’s easy to monitor
  • does not require any additional hardware or configuration
From an high-availability point of view this approach seems to have a huge drawback: When either the dispatcher or the publish instance fails, the other part is not available as well.
Before we discuss this, let me state some facts, which I consider as basic and foundation to all my arguments here:
  • The dispatcher and the web server (I can only speak for Apache HTTPD and its derivates, sorry IIS!) are incredibly stable. In the last 9 years I’ve setup and operated a good number of web environments and I’ve never seen a crashing web server nor a crashing dispatcher module. As long as noone stops the process, this beast is handling requests.
  • A webserver (and the dispatcher) is capable to deliver thousands of requests per second, if these files originate from the local disks and just need to be delivered. That’s at least 10 times the number any publish can handle.
  • If you look for the bottleneck in handling HTTP requests in your AEM architecture it’s always the publish application layer. Which is exactly the reason why there is a caching layer (the dispatcher) in front of it.
  • My assumption is, that a web server on modern hardware (and operating systems) is able to deliver static files with a bandwidth of more than 500 mbit per second (at a mixed file scenario). So in most cases before you reach the limit of your web servers, you reach the limit of your internet connection. Please note, that this number is just a rough guess (depending on many other factors).
Based on these assumptions, let’s consider these scenarios in a 1:1 setup:
  • When the publish instance fails, the dispatcher instance isn’t fully operational anymore, as it does not reach its renderer instance anymore; so it’s best to take it out of the load balancing pool.
    So does this have any effect on the performance capabilities of your architecture? Of course it has, it reduces your ability to deliver static files from the dispatcher cache. Which we could avoid if we had the dispatcher connected to other publishs as well. But as stated above, the delivery performance of static files isn’t a bottle neck at all, so when we take out 1 web server you don’t see any effect.
  • A webserve/dispatcher fails, and the connected publish instance is not reachable anymore, effectively reducing the power your bottleneck even more.
    Admitted, that’s true; but as stated above, I’ve rarely seen a crashed web server; so this case is mostly true in case of hardware problems or massive misconfigurations.
So, your have an measurable impact only in case that a web server hardware went down, in all other cases it’s not a problem for the performance.
This is a small drawbacks, but from my point of view the other benefits stated above outweigh it by far.
This is my standard answer, when there’s no more specific information available. It’s a good rule of thumb. But if you have more specific requirement, it might have sense to change the 1:1 rule to a different one.
For example:
  • You plan to have 20 publish instances. Then it doesn’t make sense to have 20 webserver/dispatchers as well.
  • You want to serve a huge amount of static data (e.g. 100 TB of static assets), so your n copies of the same file get expensive in terms of disk space.
If you choose a different approach than the 1:1 scenario described in this blog post, please keep these factors in mind:
  • How do you plan to invalidate the dispatcher caches? Which publish instance will invalidate which dispatcher cache?
  • How do you plan to do maintenance of the publish instances?
  • What’s the effort to add or remove a new publish instance? What’s need to be changed?
Before you plan to spend a lot of time and effort into building a complex dispatcher scenario, please think if a CDN isn’t a more appropriate solution to your problem…

Writing health checks — the problem

I started my professional career in IT operation at a large automotive company, where I supported the company’s brand websites. There I learned the importance of a good monitoring system, which supports IT operations in detecting problems early and accurately. And I also learned that even enterprise IT monitoring systems are fed best with a dead-simple HTML page containing the string “OK”.  Or some other string, which then means “Oh, something’s wrong!”.

In this post I want to give you some impression about the problematics of application monitoring, especially with Sling health checks in mind. Because it isn’t as simple as it sounds in the first place, and you can do things wrong. But every application should posses the ability to deliver some information about its current status, as the cockpit in your car gives you information about the available gas (or electricity) in your system.

The problem of async error reporting

The health check is executed when the reports are requested, so you cannot just push your error information to the health check as you log them during the processing. Instead you have to write them to a queue (or any other data structure), where this information is stored, until it is consumed.
The situation is different for periodical daily jobs, where only 1 result is produced every day.

Consolidating information

When you have many individual data, but you need to build a single data point about for a certain timeframe (say 10 seconds), you need to come up with a strategy to consolidate them. A common approach is to collect all individual results (e.g. just single “ok” or “not ok” information) and adding them to a list. When the  health check status needs to be calculated, this list is iterated and the number of “OKs” and “not oks” is counted, the ratio is calculated and reported; and after that the list is cleaned, and the process starts again.

When you design such consolidation algorithms, you should always keep in mind how errors are reported. In the above mentioned case, 10 seconds full of errors would be reported only for a single reporting cycle as CRITICAL. The cycle before and after could be OK again. Or if you have larger cycles (e.g. 5 minutes for your Nagios) think how 10 seconds of errors are being reported, while in the remaining 4’50’’ you don’t have no problem at all. Should it reported with the same result as you have the same number of errors spread over this 5 minutes? How should this case be handled if you have decided to ignore an average rate of 2% failing transactions?

You see that you can you can spend a lot of thinking on it. But be assured: Do not try to be to sophisticated. Just take a simple approach and implement it. Then you’re better than 80% of all projects: Because you have actually reasoned about this problem and decided to write a health check!

About JCR queries

In the past days 2 interesting blog posts have been written about the use of JCR query, Dan Klco’s “9 JCR-SQL2 queries every AEM developer should know” and “CQ Queries demystified” by @ItGumby.

Well, when you already have read my older articles about JCR query (part 1 and part 2), you might get the impression that I am not a big fan of JCR queries. There might be situations where that’s totally true.

When you come from a SQL world, queries are the only way to retrieve data; therefor many developers tend to use query without ever thinking about the other way offered by JCR: the “walk the tree” approach.

@ItGumby gives 2 reasons, why one should use JCR query: efficiency and flexibility in structure. First, efficiency depends on many factors. In my second post I try to explain which kind of query are fast, and which ones aren’t that fast. Just because of the way the underlaying index (even with AEM 6.0 it’s in 99,9% still Lucene) is working. With the custom indexes in AEM6 we might have a game changer for this.
Regarding flexibility: Yes, that’s a good reason. But there are cases, where you have a specific structure, when you are looking for hits only in a small area of the tree. But if you need to search the complete tree, a query can be faster.

Dan gives a number of good examples for JCR queries. And I wholeheartedly admit, that the number of JCR SQL examples in the net is way too low. The JCR specification is quite readable for a large part, but I was never really good at implementing code when I only have the formal description of the syntax of the language. So a big applause to Dan!
But please allow me the recommendation to test every query first on production content (not necessarily on your production system!), just to find out the timing and the number of results. I already experienced cases, where an implementation was fast on development but painfully slow on production just because of this tiny aspect.

Managing multiple instances of services: OSGI service factories

And now the third post in my little series of OSGI related postings. I already showed how you can easily manage a number of services implementing the same interface using a service tracker or by using the right SCR annotations.

Sometimes you need to implement services, which just differ by configuration. A nice example for this is the logging, where you want to have the possibility to have multiple logging facilities being logged to log files at a different level. Somebody (normally the admin of the system) is then able to leverage this and configure the logging as she likes.

So more formally spoken you have zero to N instances of the same service, but just with different configuration. Just duplicating the code and create a logger1 with configurable values, a logger2, logger3 and so doesn’t make sense, as it’s just code duplication and inflexible (what happens, when you need logger100?).

OSGI offers for this kind of problem the concept of ServiceFactories. As the name already says, it’s a factory to create services, just by configuration.

As an example let’s assume, that you need to send out emails via a configurable number of SMTP servers, because for internal emails you need to use a different mailserver than for external users or partners. We will implement this as service, and on the service we will configure the details of the SMTP service we intend to use.

So let’s start with the service interface:

public interface MailService {
  public void sendMail (String from, String to, String body);
}

And a dummy implementation could look like this; we want to focus only on the properties, and not really on the details how to send emails 🙂

@Service
@Component(metatype=true,label="simple mailservice", description="simple mailservice")
public class MailServiceImpl implements MailService {

  private static final String DEFAULT_ADDRESS="localhost:25";
  @Property (description="adress of the SMTP server including port",value=DEFAULT_ADDRESS)
  private static final String ADDRESS = “mailservice.address”;
  private String address;
  
  private static final String DEFAULT_USERNAME="admin";
  @Property(description=“username to login to the SMTP server”,value=DEFAULT_USERNAME)
  private static final String USERNAME = “mailservice.username”;
  private String username;

  private static final String DEFAULT_PASSWORD="admin;
  @Property(description=“password to login to the SMTP server”,value=DEFAULT_PASSWORD)
  private static final String password = “mail service.password”;
  private String password;

  @Activate
  protected void activate (ComponentContext ctx) {
    address = PropertiesUtil.toString(ctx.getProperties().get(ADDRESS),DEFAULT_ADDRESS);
    username = PropertiesUtil.toString(ctx.getProperties().get(USERNAME),DEFAULT_USERNAME);
    password = PropertiesUtil.toString(ctx.getProperties().get(PASSWORD),DEFAULT_PASSWORD);
  }

  public void sendMail(String from, String to, String body) {
    // login to the smtp server using address, username and password provided via OSGI 
    // properties and send the email
  }
}

But how can you extend this and make sure, that you cannot just create a configuration for 1 mailserver, but for multiple ones? Easy, just add the property “configurationFactory=true” the @Component annotation.

@Component(metatype=true,label="simple mailservice", description="simple mailservice",
  configurationFactory=true)

If you compile and deploy your service now, you can see in your Apache Felix Configuration Manager, that you have a “plus” sign in front of the service; and when you click it, you get a new instance of your service, which you can configure.

ConfigurationFactory in the Felix Console
ConfigurationFactory in the Felix Console

But when you have 3 mail services configured, which one do you get when you have something like this:

@Reference
MailService mailService;

The answer: It’s not deterministic. You might get any of the configured mailservices. If you need a special one, the easiest way is to provide labels for them to make them unique. So add another property to your service:

@Property(description=“Label for this SMTP service”)
private static final String NAME = “mailservice.label”

And when you create then a configuration with the label “INTERNET”, you can reference exactly this service instance with this kind of reference:

@Reference(“(mailservice.label=INTERNET)”)
MailService mailService;

This will resolve correctly when you have a mailservice service configured with the label “INTERNET”. As long as you don’t have such a service, any service containing such a reference won’t start (unless you create a dynamic reference …)

If you want to be more flexible and also implement some more logic in the lookup process (e.g. having a default mailservice or supporting a number of INTERNET mail services), you can use the whiteboard pattern to track all available MailService instances; based on their labels you can implement any logic you need.

As you see, OSGI is quite powerful when it comes to looking up and connecting to services. Combined with the power of SCR you can easily create lot of configurable services with very little effort. Managing these services and doing proper lookup is also just a few lines of code away.

Personally I really like the possibilities I have with the OSGI container inside of AEM, it gives me the flexibility to access lots of different parts of the system. And creating and injecting my own services is easier than ever.

The magic of OSGI: track the come and go of services

Have you already asked yourself, how it comes, that you just need to implement an interface, mark the implementation as service, and — oh magic — your code is called at the right spot. Without any registration.
For example when you wrote your first sling servlet (oh sure, you did it!) it looked like this:

@Service
@Component
@Property (name=“sling.servlet.paths”, value=“/bin/myservlet”)
public class MyServlet extends SlingSafeMethodServlet  {
  ...
}

and that’s it. How does Sling (or OSGI or whoever) knows that there is a new servlet in the container and calls it, when you visit $Host/bin/myservlet. Or how can you build such a mechanism yourself?

Basically you just use the power of OSGI, which is capable to notify you about any change of the status of an existing bundle and service.

If you want to keep track of all Sling servlets registered in the system you just need to write some more annotations:

@Service
@Component(value=ServletTracker.class)
@Reference (name=“servlets”, policy=ReferencePolicy.DYNAMIC, cardinality=ReferenceCardinality.OPTIONAL_MULTIPLE, interfaceReference=Servlet.class)
public class ServletTracker {

List<Servlet> allServlets = new ArrayList<Servlet>();

protected bindServlets (Servlet servlet, final Map<String, Object> properties) {
  allServlets.add (servlet);
}

protected unbindServlets(Servlet servlet, final Map<String,Object> properties) {
  allServlets.remove(servlet);
}

public doSomethingUseful() {
  for (Servlet servlet : allServlets) {
    // do something useful with them ...
  }
}

(Of course you can track any other interface through which services are offered. But be aware, that in many cases only a single instance of a service exists.)

The magic mostly is in the @Reference annotation, which defines that there is optional reference, which takes zero till unlimited references of services implementing the class/interface “Servlet”.  By default methods are called, for which the names are constructed using the “name” statement, resulting in “bindServlets” when a new servlet is registered, and “unbindServlets” when the servlet is unregistered. You can use these methods to whatever you want to do, for example storing the references locally and calling them whenever appropriate. And that’s it.

If you use this approach, your code is called whenever some service implementing a certain interface is being activated/deactivated. With the SCR annotations it’s all possible without having too much trouble and the best of all: Nearly all just by declaration.
If you like to have some more control over  (or just want to code) you can use a ServiceTracker (a nice small example for it is the Apache Aries JMX Whiteboard) to keep manually track of services.

And as recommended reading right now: Learn all the other cool stuff at the Felix SCR page.

And if you need to have more code which uses this approach, you might want to have a look at the SlingPostServlet, which is a excellent example of using this pattern. Oh, and by the way: This pattern is called OSGI whiteboard pattern.

AEM 6.0 and Apache Oak: What has changed?

One of the key features of AEM6.0 on the technical side is the use of Apache Oak as a much more scalable repository. It supports the full semantic of JCR 2.0, so all CQ 5.x applications should continue to work. And as an extension of this feature, there is of course mongoDB, which you can use together with Oak.

But, as with ever major reimplementation, something has changed. Things, which worked well on Jackrabbit 2.x and CRX 2.x might behave differently. Or to put in other words: Jackrabbit 2.x allowed you to do some things, which are not mandated by the JCR 2.0 specification.

One of the most prominence examples for this is the visibility of changed nodes. In CRX 2.x when you have an open JCR session A, and in a different session B some nodes are changed, you will see these changes immediately in session A. That’s not mandated by the specification, but Jackrabbit supports it.

Oak introduced the concept of MVCC (multi version concurrency control), which makes that each session only sees a view of the repository, which has been the most recent one the session has been created, but it’s not updated on-the-fly with the changes performed by other sessions. So this is a static view. If you want to get the most recent view of the repository, you need to call explicitly “session.refresh()”.

So, what’s the effect of this?
You may run into subtle inconsistencies, because you don’t see changes performed by others in your session. In most cases, only long-running sessions are really affected by this, because for them it’s often intended to react on changes from the outside, and that you can react on changes made by other threads (e.g. you can check if a certain node has already been created by another session). So if you already have followed the best practices established in the last 1-2 years, you should be fine, as long-running sessions have been discouraged. I also already showed, how such a long-running session might affect performance when used in a service context.

Oak supports you with some more “features” to spot such problems more easily. First, it prints a warning to the log, when a session is open for more than 1 minute. You can check the log and review the use of this sessions. A session being open more than 1 minute is normally a clear sign, that something’s wrong and that you should think about creating sessions with a smaller lifespan. On the other hand you can imagine also cases, where a session open for some more time is the right solution. So you need to carefully evaluate each warning.
And as second “feature”, Oak is able to propagate changes between sessions, if these changes are performed by a single thread (and only by a single thread).
But consider these features (especially the change propagation) as transient features, which won’t be supported forever.

This is one of the interesting parts of the changes in Apache Oak compared to Jackrabbit 2.x, you can find some more in in the Jackrabbit/OAK Wiki. It’s really worth to have a look at when you start with your first AEM 6.0 project.

AEM 6.0: Admin sessions

With AEM6.0 comes a small feature, which you should use to reconsider your usage of sessions, especially the use of admin sessions in your OSGI services.

The feature is: “ResourceResolverFactory.getAdministrativeResourceResolver” is going to be deprecated!

Oh wait, that should be a feature, you might ask. Yes, it is. Because it is being replaced by a feature, which allows you to easily replace the sessions, previously owned by the admin (for the sake of laziness of the developer …) by sessions owned by regular users. Users which don’t have the super-power of admin, but regular users, which have to follow the ACLs as any other regular user.

A nice description how it works can be found on the Apache Sling website.

But how do you use it?

First, define what user should be used by your service. Specify this in the form “symbolic-bundle-name:sub service-name=user” in the config of the ServiceUserMapper service.
Then there 2 extensions to existing services, which leverage this setting:

ResourceResolverFactory.getServiceResourceResolver(authenticationInfo) returns a ResourceResolver created for the user defined in the ServiceUserMapper for the containing bundle (you can specify the sub service in the authenticationInfo if required).

And the SlingRepository service has an additional method loginService(subserviceName, workspace), which returns you a session using this user.

But then this leaves you with the really big task: What permissions should my service user have then? read/create/modify/delete on the root node? But that’s something you can delegate to the people who are doing the user management …

Update 1: Sune asked if you need to specify a password. Of course not 🙂 Such a requirement would render the complete approach redundant.

Meta: AEM 6.0

I am a bit behind the official announcement of AEM 6.0 (Adobe TV, docs ), also some of my colleagues have taken the lead and already started posting about the major new technical features, Apache Oak (including MongoDB) and Sightly. My colleague Jayna Kandathil offers a nice overview of the technical news.

I will focus on the smaller changes in the stack, and there’s a vast number of it. So stay tuned, I hope to find some quiet moments to blog in the next 2 weeks.