OSGI: static and dynamic references

OSGI as component model is one of the cores of AEM, as it allows to dynamically register and consume services offered by other parts of the system. It’s the central registry you can ask for all kind of services.

If you have some weeks of CQ experience as developer, you probably already know the mechanics how to access a product service. For example one of the most often used statements in a service (or component) is:

@Reference
 SlingRepository repo;

which gives you access to the SlingRepository service, through which you can reach the JCR. Technically spoken, you build a static reference. So your service gets active only when this reference can be resolved. By this you can rely on the repository being available whenever your service is running. This is a constraint which is not a problem in many cases. Because it wouldn’t make sense for your service to run without the repository, and it also frees to permanently checking “repo” for being not null 🙂

Sometimes you don’t want to wait for a reference to be resolved (maybe breaking a dependency loop) or you can just deliver additional value if a certain (optional) service is available. In such cases you can make it a dynamic reference

@Reference(policy=ReferencePolicy.DYNAMIC)
SlingRepository repo;

Now there’s no hard dependency to the SlingRepository service; so your service might get active before the SlingRepository service is available, and therefor you need to handle the case that “repo” is null.

Per se this feature might have little importance to you, but combining it with other aspects makes it really powerful. More on that in the next post…

Why is it hard to do disk size estimations upfront?

Whenever a new project starts, a project manager is responsible to do a project sizing, so that the right amount of people with the right skills are assigned to the project. In many projects another early task is to size the hardware. This has mostly to do with the time to buy and deploy new hardware, which can be pretty long. On one project I did it took the IT 10 months (!!) from the decision to buy 8 of “these” boxes until they have been able to login on that box. And by the way, this was the regular hardware purchasing process with no special cases …

Anyway, even if takes only 6 weeks for the whole “new hardware purchase and deployment” process, you cannot just start and determine then what hardware is needed. When development starts, a development infrastructure must be provided, consisting of a reasonable amount of systems with enough available resources. So one of the earliest project tasks is an initial system sizing.

If you have done it a few times for some specific types of projects (for example CQ5 projects) you can give some basic sizing, without doing major calculations; at that time you usually doesn’t have enough information to do a calculation at all. So for a centralized development system (that’s where the continous integration server deploys to) my usual recommendation is “4 CPUs, 8-12G RAM, 100G free disk; this for 1 author and 1 publish”. This is a reasonable system to actually run development on. (Remember, that each developer has on her laptop also an authoring and publishing system deployed, where they actually try out their code. On these central developement systems all development tests are executed, as well as some integration tests.)

This gets much harder, if we talk about higher environments like staging/pre-production/integration/test (or however you might call it) and — of course — production. Because there we have much more content available. This content is the most variable factor in all calculations, because in most requirement documents it is not clear, how much content will be in the system in the end, how much assets will be uploaded, how often they are changed, and when they will expire and can be removed. To be honest, I would not trust any number given in such a document, because this usually changes over time and even during the very first phase of the project. So you need to be flexible regarding content and also regarding the disk space calculation.

My colleague Jayan Kandathil posted a calculation for the storage consumption of images in the datastore. It’s an interesting formula, which might be true (I haven’t validated the values), but I usually do not rely on such formulas because:

We do not only upload images to DAM, and besides the datastore we also have the TarPM and the Lucene index which contribute to the overall repository growth.
I don’t know if there will be adjustments to the “Asset update” workflow, especially if more/less/changed renditions will be created. With CQ 5.5 also any change of asset metadata will affect the datastore (XMP writeback changes the asset binary! This results in a new file in the datastore!).
I don’t know if I can rely on the numbers given in the requirements document.
There is a lot of other content in the repository, which I usually cannot estimate upfront. So the datastore consumption of the images is only a small fraction of the overall disk space consumption of the repository.

So instead of calculating the disk size based only on assumptions, I usually tell a disk size to start with. This number is soo high, that they won’t fill it up within the first 2-3 months. But it is also not that large, that they will never ever reach 100% of its size. It’s somewhat in between. So they can go live with it and they need to monitor it. Whenever they reach 85% of the disk size, IT has to add more disk space. If you run this process for some time, you can do a pretty good forecast on the repository growth and react accordingly by attaching more disk space. I cannot do this forecast upfront, because I don’t have any reliable numbers.

So, my learning from this: I don’t spend that much time in disk calculations upfront. I only give the customer a model, and based on this model they can react and attach storage in a timely manner. Also this is the most cheapest version, because you attach storage only when it’s really needed and not based on some unreliable calculation.

CQ coding patterns: Sling vs JCR (part 3)

In the last 2 postings (part 1, part 2) of the “Sling vs JCR” shootout I only handled read cases, but as Ryan D Lunka pointed out in the comments of on part 1, writing to the repository is the harder part. I agree on that. While on reading you often can streamline the process of reading nodes and properties and ignore some bits and pieces (e.g. mixins, nodeTypes), on the write part that’s a very important information. Because on the write part you define the layout and types, so you need to have full control of it. And the JCR API is very good in giving you control over this, so there’s not much room for simplification for a Sling solution.

Indeed, in the past there was no “Resource.save()” method or something like this. Essentially the resourceResolver (as a somehow equivalent of the JCR session) was only designed for retrieving and reading information, but not for writing. There are some front ends like the Sling POST servlet, which access directly the JCR repository to persist changes. But these are only workarounds for the fact, that there is no real write support on a Sling Resource level.

In the last months there was some activity in that area. So there is a wiki page for a design discussion about supporting CRUD operations within Sling, and code has been already submitted. According to Carsten Ziegler, the submitter of this Sling issue, this feature should be available for the upcoming version of CQ. So let’s see how this will really look like.

But till then you have 2 options, to get your data into the CRX repository:

Use the SlingPostServlet: Especially if you want to create or change data in the repository from remote, this servlet is your guy. It is flexible enough to cater most of normal use cases. For example most (if not all) of the CQ dialogs and actions use this SlingPostServlet to persist their data in the repository.
If that’s not sufficient for you, or you need to write structures directly from within code deployed within CQ, use JCR.

So here Sling currently does not bring any benefit, if your only job is to save data from within OSGI to CQ. If many cases the Default POST Servlet might be suffcient when you need to update data in CRX from external. And let’s see what CQ 5.6 brings with it.

CQ coding patterns: Sling vs JCR (part 2)

In the last posting I showed the benefits of Sling regarding resource access over plain JCR. But not only in resource access both frameworks offer similar functionality, but also in the important area of listening to changes in the repository. So today I want to compare JCR observation to Sling eventing.

JCR observation is a part of the JCR specification and is a very easy way to listen for changes in the repository.

@component (immediate=true, metatype=false)
@service
class Listener implements ObservationListener {

  @Reference
  SlingRepository repo;

  Session session;
  Logger log = LoggerFactory.getLogger (Listener.class);

  @activate
  protected void activate () {
    try {
      Session adminSession = repo.loginAdministrative(null);
      session = adminSession.impersonate (new SimpleCredentials("author",new char[0]));
      adminSession.logout();
      adminSession = null;
      session.getObservationManager.addEventListener( this, // listener
        NODE_CREATED|NODE_DELETED|NODE_MOVED, // eventTypes
        "/", // absPath
        true, // isDeep
        null, // uuid
        null, //nodeTypeNames
        true // noLocal
      );
    } catch (RepositoryException e) {
      log.error ("Error while registering observation", e);
    }
  }

  @deactivate
  protected void deactivate() {
    session.getObservationManager.removeListener(this);
    session.logout();
    session = null:
  }

  private handleEvents (Events events) {
    while (events.hasNext()) {
      Event e = events.next();
      … // do here your event handling
     }
  }
}

In JCR the creation of an observation listener is straight forward, also the event listening. The observation process is tied to the lifetime of the session, which is started and stopped at activation/deactivation of this sling service. This kind of implementation is a common pattern.
Note: Be aware of the implications of using an admin session!

You can read and write everywhere in the repository
You work with elevated rights you probably never need.

So try to avoid an adminSession. In th example above is use a session owned by the user “author” for this via impersonation.

A sling implementation of the same could look like this:

@Component (immediate = true)
@Service()
@Property (name = "event.topics", value = "/org/apache/sling/api/resource/Resource/*")
class Listener implements EventHandler {

  public void handleEvent (Event event) {
    // handle
  }
}

You see, that we don’t have to care about registering a listener and managing a session. Instead we just subscribe to some events emitted by the Sling Eventing framework. This framework is essentially implementing the OSGI event specification, and therefor you can also subscribing the very same way to various other event topics. You can check the “Event” tab in the Felix console, where you can see a list of events, which just happened in your CQ.

Some example for topics:

BundleEvents (the complete OSGI bundle live cycle is reflected here)
CQ Replication
CQ Workflow
Sling Jobs
Sling Resources

One disadvantage of Sling eventing is, that you cannot restrict the resource events to a certain path or user. Instead Sling registers with an admin session and publishes all observation events starting at the root node of the repository (/). You should filter quite early in your handleEvent routine for the events you are really interested in.

But: With these events you don’t geld hold of a resource. You always have you create them by this pattern:

if (event.getProperty(SlingConstants.PROPERTY_PATH) != null) {
  Resource resource = resourceResolver.getResource(event.getPath()); 
}

Also there you need to take care, that you’re using an appropriate ResourceResolver (try to avoid an admin resource resolver as well).

Also JCR observation has some benefits, if you want or need to limit the events you receive more granularly, for example by listening to changes of specific nodeTypes only or mixins.

So, as a conclusion: If you need to restrict the amount of repository events you want to process directly on API level, JCR observation is the right thing for you. Otherwise go for Sling Eventing, as it offers you also the possibility to receive many more types of events. For non-repository events it’s the only way to listen for them.

User administration on multi-client-installations

Developing an application for a multi-client-installation isn’t only a technical or engineering quest, but also reveals some question, which affect administration and organisationial processes.

To ease administration, the user accounts in CQ are often organized in a hiearchy, so that users which are placed higher in the hierarchy, can administrate user which are lower in the hierarchy tree below them. Using this mechanism a administrator can easily delegate the administration of certain users to other users, which can also do adminstrative works for “their” users.

The problem arises when a user has to have rights in 2 applications within the same CQ instance and every application should have its own “application administrator” (a child node to the superuser user). Then this kind of administration is no longer possible, because it is impossible to model a hierarchy where neither application administrator user A has a parent or child relation to application administration user B nor A and B are placed in the hierarch higher than any user C.

I assume that creating accounts for different application but the same person isn’t feasible. That would be the solution which the easiest one from an engineering point of view, but this does contradict the ongoing move not to create for each application and each user a new user/password pair (single sign on).

This problem imposes the burden of user administration (e.g assigning users to groups, resetting passwords) to the superuser, because the superuser is the user, which is always (either by transition or directly) parent to any user. (A non-CQ-based solution would be to handle user related changes like password set/reset and group assignment outside of CQ and synchronize these data then into CQ, e.g. by using a directory system based on LDAP.)

ACLs, access to templates and workflows should be assigned only using groups and roles, because these can be created per application. So if an application currently is based on a user hierarchy and individual user rights it’s hard to add a new application using the same user.

So one must make sure, that all assignments are only based on groups and roles, which are created per application. Assigning individual rights to a single user isn’t the way to go.

Being a good citizen in multi-client-installations

Working in a group requires a kind of discipline from people, which some are not used to. I remember to a colleague, who always complained about his office mate who used to shout at the phone, even in normal talks. If people work together and share ressources, everyone expects to be cooperative and not to trash the work and morale of their team members.

The same applies to applications; if they share ressources (because they may run on the same machine), they should release the ressources if they are no longer needed, and should only claim ressources, if they’re needed at all. Consuming all CPU because “the developer was to lazy to develop a decent algorithm and just choose the brute-force solution” isn’t considered a good behaviour. It’s even harder if these applications are contained within one process, so a process crash not only affects application A, but also B and C. And then it doesn’t matter, that these are well-thought and perfectly developed.

So if you plan to deploy more than one appliction to a single CQ instance, you should take care, that the developers were aware of this condition and they had it in their mind. Because the application does no longer control the heap usage on its own (on top of the heap consumption of CQ itself), but must share it with other applications. It must be programmed with stability and robustness in mind, because unknown structures and services may change assumptions about timing and sizes. And yes, a restart also affects the others. So restarting because of application problems isn’t a good thing.

In general an application should never require a restart of the whole JVM; only when it comes to necessary changes to JVM parameters, it should be allowed. But all other application specific settings should be changable through special configuration templates which are evaluated during runtime, so changes are picked up immediately. This even reduces the amount of work for the system administrator, because changing such values can be delegated to special users using the ACL mechanism.

CQ and multi-client capability

In large companies there’s a need for a WCMS like Day CQ in many areas; of course for the company webpage, but also for some divisions offering services to internal or external customers, which either implement their application on top of Day CQ, use CQ as a proxy for other backend systems or just need a system for providing the online help.

Some of these systems are rather small, but nevertheless the business needs a system, where authors can create and modify content according to the business’ needs. If you already have CQ and knowledge in creating and operating applications, it’s quite natural to use it to satisfy these needs. The problem is, that building up and operating a CQ environment isn’t cheap for only one application. Very often one asks if CQ is capable of hosting several application within one CQ installation to leverage effets of scale. CQ is able to handle hundreds of thousands of content handles per instance, so it’s able to host 5 applications with 20 thousands handles each, isn’t it?

Well, that’s a good question. By design CQ is able to host multiple applications, enfore content separation by the ACL concept and limit the access to templates for users. Sounds good. But — as always — there are problematic details.

I will cover these problems and ways to their resolution in the next posts.

Is CRX 1.4.2 production ready?

The contentbus technology was the standard storage backend till the CQ 4.x-series Although file-based storage wasn’t the great deal even in the late 1990s (mysql was already invented, postgres existed plus at least half a dozen enterprise database systems), Day choose to store the content objects in individual files, hidden by a abstraction layer. Of course it took some time of tuning and making experiences, but the contentbus proved to be a reliable storage which had the big point, that with an editor on the filesystem you can solve nearly all problems (we used more than once sed to fix our default.map)

But some points were still open:

Online backup isn’t possible. The documentation simply states: “Shutdown the CQ, copy your files, and startup again”. Although you can speed up the copy, if you replace with it with a snapshot on filesystem layer, but this need to restart doesn’t make it enterprise-ready. (Databases offer online-backup since at least a decade).
Contentbus requires a lot of filesystem I/O (mainly the system calls open and close). Having a lot of these operations slows down the processing. A small number of larger files would reduce this administrative overhead in the filesystem.
Memory usage of contentbus artifacts: Some artifacts like the default.map and zombie.map have in-memory data-structures, which grow as the underlying files grow (or vice-versa). The more content you have the more memory is used. Even if only a small part of this content is in active use. This doesn’t scale well.
The contentbus offers cluster support, but only with 2 nodes; with more nodes the overall performance will even degrade! According to the cluster documentation for CRX 1.4, Day tested CRX in a clustered setup with 6 nodes. If the performance loss is acceptable (that means, 6 nodes offer more performance than 5 nodes), this would be a real good solution to scale your authoring systems.

So we decided that’s time to evaluate if CRX would be at least as good as the contentbus. The TAR persistence manager adresses mainly the backup issue, we hope that we get some performance improvements as well.

So currently I’m doing a test setup of CQ 4.2.0 and CRX 1.4.2, for which Day offered (just in time :-)) a knowledge base article.

Take care of your selectors!

Recently I have shown two scenarios, where selectors can be used as a way to cache several different views of a single page. This allows one to avoid HTTP parameters quite often, reducing the load on your machines and speeding up your website.

Let’s assume that you have the already mentioned handle /etc/medialibrary/trafficjam.html and your templates support to display the image in 3 different sizes “preview”,”big” and “original”. So what does happen, if somebody chooses to request the URL “/etc/medialibrary/trafficjam.tiny.html”?

I checked some CQ-based websites and tested this behaviour. Just adding a dummy-selector. In most cases you get a proper page rendered, looking the same way as without the selector. So most templates (and also template developer) ignore the selector, if the that specific template isn’t expected to handle them. So it is good, isn’t it?

Well, in combination with the dispatcher cache it isn’t good. Because the dispatcher caches everything which is returned with an HTTP statuscode of 200 from CQ. So just adding a “foo”-selector will place another copy of the page to the dispatcher cache. This happens also with a “foo1” selector and so on. In the end the disk is full and the dispatcher cannot write any more files to the disk, but will forward every request to your CQ.

So, how can you bypass this problem? As said, the dispatcher caches only, when it receives an HTTP statuscode 200. So you need to add some code to your templates which always check the selectors. If this specific template doesn’t support any selector, fine. If called with a selector, don’t return a statuscode 200, but a 302 (permanent redirect) to the same page without any selectors or just a plain 404 (“file not found”); because calling this page with selectors isn’t a valid action and should never happen, such a statuscode is ok. The same applies when the templates supports a limited set of selectors (“preview”, “big” and “original” in the example above); just add them as a whitelist and if the given selector doesn’t match, return a 302 or 404 code.

So you don’t pollute your cache and still have the flexibility to use selectors. I think that this will outweigh the cost of adjusting your templates.

Permission sensitive caching

In the last versions of the dispatcher (starting with the 4.0.1 release) Day added a very interesting feature to the dispatcher, which allows one to cache also content on dispatcher level which are not public.

Honwai Wong of the Day support team explained it very well on the TechSummit 2008. I was a bit suprised, but I even found it on slideshare (the first half of the presentation)

Honwai explains the benefits quite well. From my experience you can reduce the load on your CQ publishers (trading a request which requires the rendering of a whole page to to a request, which just checks the ACLs of a page).

If you want to use this feature, you have to make sure that for every group or user, who has to have a individual page, the dispatcher delivers the right one. Imagine you want the present the the logged-in users the latest company news, but not logged-in users shouldn’t get them. And only the managers get the link to the the latest financial data on the startpage. So you need a startpage for 3 different groups (not-logged-in users, logged-in users, managers), and the system should deliver it appropriatly. So having a single home.html isn’t enough, you need to distinguish.

The easiest way (and the Day-way ;-)) is to use a selector denoting the group the user belongs to. So home.group-logged_in.html or home.managers.html would be good. If no selector is given, we assume the user to be an anonymous user. You have to configure the linkchecker to rewrite all links to contain the correct selector. So if a user belongs to the logged_in group and requests the home.logged_in.html page, the dispatcher will ask the CQ ” the user has the following http header lines and is requesting the home.logged_in.html, is it ok?”. CQ then checks if the given http header lines do belong to a user of the group logged_in; because he is, it responses with “200 OK, just go on”. And then the dispatcher will deliver the cached file and there’s no need for the CQ to render the same page again and again. If the users doesn’t belong to that group, CQ will detect that and send a “403 Permission denied”, and the dispatcher forwards this answer then to the user. If a user is member of more than one group, having multiple “group-“selectors is perfectly valid.

Please note: I speak of groups, not of (individual) users. I don’t think that this feature is useful when each user requires a personalized page. The cache-hit ratio is pretty low (especially if you include often-changing content on it, e.g daily news or the content of an RSS feed) and the disk consumption would be huge. If a single page is 20k and you have a version cached for 1000 users, you have a disk usage of 20 MB for a single page! And don’t forget the performance impact of a directory filled up with thousands of files. If you want to personalize pages for users, caching is inappropriate. Of course the usual nasty hacks are applicable, like requesting the user-specific data via an AJAX-call and then modifying the page in the browser using Javascript.

Another note: Currently no documentation is available on the permission sensitive caching. Only the above linked presentation of Honwai Wong.