Writing backwards compatible software

On last week’s adaptTo() conference I discussed the topic of AEM content migration with a few attendees, and why it’s a tough topic. I learned that the majority of these adhoc-migrations are done, because they are mandated by changes in the components themselves. And therefor migrations are required to adhere to the new expectations of the component. My remark “can’t you write your components in a way, that they are backwards compatible” was not that well received … it seems that it this is a hard topic for many.

And yes, writing backwards compatible components is not easy, because it comes with a few prerequisites:

The awareness, that you are making a change, which breaks compatibility with existing software and content. While the breakages in “software” can be detected easily, the often-times very loose contract between code and content is much harder to enforce. With some experience in that area you will develop a feeling for that, but especially less experienced folks can make such changes inadvertently, and you will detect that problem way too late.
You need to have a strategy which tells how to handle such a situation. While the AEM WCM Core Components introduced a versioning model, which seems to work quite nicely, an existing codebase might not be prepared for this. It forces some more structure and thoughts how to design your codebase, especially when it then comes to Sling Models and OSGI services, and where to have logic, so you don’t duplicate it.
And even if you are prepared for this situation, it’s not for free, you will end up with new versions of components which you need to maintain. Just breaking compatibility is much easier, because you still will have just 1 component.

So I totally get if you don’t care about backwards compatibility at all, because you are in the end the only consumer of your code, and you can control everything. You are not a product developer, where backwards compatibility needs to have a much higher priority.

But backwards compatibility gives you one massive benefit, which I consider as quite important: It gives you the flexibility to perform a migration to a time which is a good fit. It’s not that you need to perform this migration before, in the midst or immediately after a deployment. You deploy the necessary code, and then migrate thecontent when its convenient. And if that migration date is pushed further for whatever reason, it’s not a problem at all, because this backwards compatibility allows you to decouple the technical aspect of it (the deployment) from the actual execution of the content-migration. And for that you don’t need to re-scope deployments and schedules.

So maybe this is just me with the hat of a product developer, who is so focused on backwards compatibility. And in the wild the cost over backwards-compatibility is much higher than the flexibility it allows. I don’t know. Leave me a comment if you want to share your opinion.

How not to do content migrations

(Note: This post is not about getting content from environment A to B or from your AEM 6.5 to AEM CS.)

The requirements towards content and component structure evolve over time; the components which you started initially with might not be sufficient anymore. For that reasons the the components will evolve, they need new properties, or components need to be added/removed/merged, and that must be reflected in the content as well. Something which is possible to do manually, but which will take too much work and is too error-prone. Automation for the rescue.

I already came across a few of those “automated content migrations”, and I have found a few patterns which don’t work. But before I start with them, let me briefly cover the one pattern, which works very well.

The working approach

The only working approach is a workflow, which is invoked on small-ish subtrees of your content. It skips silently over content which does not need to be migrated, and reports every situation which got migrated. It might even have a dry-run mode, which just reports everything it would change. This approach has a few advantages:

It will be invoked intentionally on author only, and only operates a single, well-defined subtree of content. It logs all changes it does.
It does not automatically activate every change it has done, but requires activation as a dedicated second step. This allows to validate the changes and activate it only then.
If it fails, it can repeatedly get invoked on the same content, and continue from were it has left.
It’s a workflow, with the guarantees of a workflow. It cannot time out as a request can do, but will complete eventually. You can either log the migration output or store it as dedicated content/node/binary data somewhere. You know when a subtree is migrated and you can prove that it’s completed.

Of course this is not something you can simply do, but it requires some planning in both designing, coding and the execution of the content migration.

Now, let’s face the few things which don’t work.

Non-working approach 1: Changing content on the fly

I have seen page rendering code, which tries to modify the content it is operating on, removing old properties, adding new properties either with default values and other values.

This approach can work, but only if the user has write permissions on the content. As this migration happens at the first time the rendering is initiated with write permissions (normally by a regular editor on the authoring system), it will fail in every other situation (e.g on publish if the merging conditions exist there as well). And you will have a non-cool mix of page rendering and content-fixup code in your components.

This is a very optimistic approach, over which you don’t have any control, and for that reason you probably can never remove that fixup code, because you never know if all content has already been changed.

Non-working approach 2: Let’s do it on startup

Admitted, I have seen this only once. But it was a weird thing, because a migration OSGI service was created, which executed the content migration in its activate() method. And we came across it because this activate delayed the entire startup to a situation, which caused our automation to run into a timeout, because we don’t expect a startup of an AEM instance to take 30+ minutes.

Which is also its biggest problem and which makes it unusable: You don’t have any control over this process, it can be problematic in the case of clustered repositories (in AEM CS authoring) and even if the migration has already been completed, the check if there’s something to do can take quite long.

But hey, when you have it already implemented as service, it’s quite easy to migrate it to a workflow and then use the above recommended approach.

Let me know if you have found other cases of working or non-working approaches for content migration; but in my experience it’s always the best way to make this an explicit task, which can be planned, managed and properly executed. Everything else can work sometimes, but definitely with a less predictable outcome.

What’s the maximum size of a node in JCR/AEM?

An interesting question which comes up every now and then is: “Is there a limit how large a JCR node can get?”. And as always in IT, the answer is not that simple.

In this post I will answer that question and also outline why this limit is hardly a constraint in AEM development. Also I will show ways how you can design your application so that this limit is not a problem at all.

Continue reading →

Roles & Rights and complexity

An important aspect in every project is the roles and rights setup. Typically this runs side-by-side with all other discussions regarding content architecture and content reuse.

While in many projects the roles and rights setup can be implemented quite straight forward, there are cases where it gets complicated. In my experience this most often happens with companies which have a very strong separation of concerns, where a lot of departments with varying levels of access are supposed to have access to AEM. Where users should be able to modify pages, but not subpages; where they are supposed to change text, but not the images on the pages. Translators should be able to change the text, but not the structure. And many things more.

I am quite sure that you can implement everything with the ACL structure of AEM (well, nearly everything), but in complex cases it often comes with a price.

Performance

ACL evaluation can be costly in terms of performance, if a lot of ACLs needs to be checked; and especially if you have globally active wildcard ACLs. As every repository acecss runs through it, it can affect performance.

There is not hard limit in number of allowed ACLs, but whenever you build a complex ACL setup, you should check and validate its impact to the performance.

Wildcard ACLs can be time consuming, thus make them as specific as possible.
The ACL inheritance is likely to affect deep trees with lots of ACL nodes on higher-level nodes.

Maintainability

But the bigger issue is always the maintenance of these permissions. If the setup is not well documented, debugging a case of misguided permissions can be a daunting task, even if the root cause is as simple as someone being the member of the wrong group. Just imagine how hard it is for someone not familiar with the details of AEM permissions if the she needs to debug the situation. Especially if the original creator of this setup is no longer available to answer questions.

Some experiences I made of the last years:

Hiding the complexity of the site and system is good, but limiting the (read) view of a user only to the 10 pages she is supposed to manage is not required, it makes the setup overly complex without providing real value.
Limiting write access: I agree that not everyone should be able to modify every page. But limiting write access only to small parts of a page is often too much, because then the number of ACLs are going to explode.
Trust your users! But implement an approval process, which every activation needs to go through. And use the versions to restore an older version if something went wrong. Instead of locking down each and every individual piece (and then you still need the approval process …)
Educate and train your users! That’s one of the best investments you can make if you give your users all the training and background to make the best of the platform you provide to them. Then you can also avoid to lock down the environment for untrained users which are supposed to use the system.

Thus my advice to everyone who wants (or needs) to implement a complex permission setup: Is this complexity really required? Because this complexity is rarely hidden, but in the end something the project team will always hand-over it to the business.

Ways to achieve content reuse in AEM

Whenever an AEM project starts, you have a few important decisions to make. I already wrote about content architeture (here and here) and its importance to a succesful project and an efficient content development and maintenance process. A part of this content architecture discussion is the aspect of content reuse.

Content reuse happens on every AEM project, often it plays a central role. And because requirements are so different, there are many ways to achieve content reuse. In this blog post I want to outline some prominent ways you can use to reuse content in AEM. Each one comes with some unique properties, thus pay attention to them.

I identified 2 main concepts of content reuse: Reuse by copy and Reuse by reference.

Reuse by copy

The AEM multisite manager (MSM) is probably the most prominent approach for content reuse in AEM. It’s been part of the product for a long time and therefor a lot of people know it (even when you just have started with AEM, you might came across its idioms). It’s an approach which creates independent copies of the source, and helps you to keep these copies (“livecopies”) in sync with the original version (“blueprint”) . On the other hand side you still can work with the copies as you like, that means modify them, create and delete parts of the pages or even complete pages. With the help of the MSM you can always get back to the original state, or a change on the blueprint can be propagated to all livecopies (including conflict handling). So you can could call this approach “managed copy”.

The MSM is a powerful tool, but comes with its own set of complexity and error cases; you and your users should understand how it works and what situations can arise out of it. It also has performance implications, as copies are created; also rolling out changes on the blueprint to livecopies can be complex and consume quite some server resources. If you don’t need have the requirement to modify the copies, the MSM is the wrong approach for you!

Unlike the MSM the language copy approach just creates simple copies; and when these copies have been created there is no relationship anymore between the source and the target of the language copy. It’s an “un-managed copy”. Personally I don’t see much use in it in a standalone way (if used as part of a translation workflow, the situation is different).

Reuse by reference

Reuse by reference is a different approach. It does not duplicate content, but just adds references and then the reference target is injected or displayed. Thus a reference will always display the same content as the reference target, deviations and modifications are not possible. Referencing larger amount of content (beyond the scope of parts of a single page) can be problematic and hard to manage, especially if these references are not marked explicitly as such.

The main benefit of reuse by reference is that any change to the reference target is picked up immediately and reflected in the references; and that the performance impact is negligible. Also the consistency of the display of the reference with the reference target is guaranteed (when caching effects are ignored).

This approach is often used for page elements, which have to be consistent all over a site, for example for page headers of footers. But also the DAM is used in this way, even if you don’t embed the asset itself into the page, but rather just add a reference to it into the page).

If you implement reuse by reference, you always have to think about dispatcher cache invalidation, as in many cases a change to a reference target it not propagated to all references, thus the dispatcher will not know about it. You often have to take care of that by yourself.

Having said that, what are the approaches in AEM to implement reuse by reference?

Do it on your own: In standard page rendering scripts you already do includes, typically of child nodes inside the page itself. But you can also include nodes from different parts of the repository, no problem. That’s probably the simplest approach and widely used.

Another approach are Content Fragments and Experience Fragments. They are more sophisticated approaches, and also come with proper support in the authoring interface, plus components to embed them. That makes it much easier to use and start with, and it also offers some nice features on top like variants. But from a conceptual point if view it’s still a reference.

A special form of reuse by reference is “reuse by inheritance“. Typically it is implement by components like the iparsys or (when you code your own components) by using the InheritanceValueMap of Sling. In this case the reference target is always the parent (page/node). This approach is helpful when you want to inherit content down the tree (e.g from the homepage of the site to all individual pages); with the iparsys it’s hte content of a parsys, with the InheritanceValueMap it’s properties.

What approach should I choose?

The big differentiator of the reuse by copy and reuse by reference is the question if reused content should be adapted or changed at the location where it should be reused. As soon as you need to have the requirement “I would like to change the content provided to me”, the you need to have reuse by copy. And in AEM this normally mans “MSM”. Because content is not created once, but needs to be maintained and updated. At scale MSM is the best way to do it. But if you don’t have that requirement, use reuse by reference.
You might even use both approaches, “reuse by copy” to manage the reuse of content over different sites, and “reuse by reference” for content within a site.

Your friendly business consultant is a can help you find out which reuse strategy makes sense for your requirements.

Why JCR search is not suited for site search

Many, especially larger websites have an integrated search functionality, which let’s users directly find content of the site without using external search engines like Google or Bing. If properly implemented and used, it can be a tremendous help to get visitors directly to the information they need and want.

I’ve get questions in the past how one can implement such a site search using JCR queries. And at least in the last years my answer always was: don’t use JCR search for that. Let me elaborate on that.

JCR queries are querying the repository, but not the website

With JCR query you are querying the repository, but you don’t query the website. That sounds a bit strange, because the website lives in the repository. This is true, but in reality the website is much more. A rendered page consists of data stored below a cq:Page node. And more data from other parts of the repository. For example you pull in assets to a page, and you also add some metadata from the assets into the rendered page. You add references to other pages and include data from there.

This means, that the rendered page contains much meaningful and relevant information which can and should be leveraged from a search function to deliver the best results. And this data is not part of the cq:page structure in the repository.

Or to put in other words: You do SEO optimization for your page to deliver the most relevant results hoping that its rank on Google gets higher and more relevant for users searching for specific terms. Do you really think that your own integrated site search should deliver less relevant data for the same search?

As a site owner I do not expect that Google delivers for a certain search keyword combination a page A as highest ranked page on my site, but my internal search a different page B which is clearly less relevant for that keywords.

That means, that you should provide your site search the same information and metadata as to Google. And for JCR queries you only have the repository structure and the information stored there, and you should not optimize this as well for relevant search results, but the JCR repository structure aims for different goals (like performance, maintainability, evolvability and others).

JCR queries implement functionality not needed for site search

The JCR query implementation needs to take some concepts into account, which are often not relevant for site search, but which are quite expensive. Just to name a few:

Nodetype inheritance and Mixins: On every search there are checks for nodetypes, sometimes with the need to traverse the hierarchy and check the mixin relationships. That’s overhead.
ACL checks: Every search result needs to be acl checked before returning it, which can be huge overhead, especially if in the most simple case all (relevant) content is public and there should be no need to do such checks at all.
And probably much more.

JCR is not good at features which I would expect from a site search

Performance is not always what I expect from a site search.
With the current Oak implementation you should test every query if it’s covered by indexes; as site search queries are often unpredictable (especially if you do not only allow for a single search term, but also want to include wild cards etc) you’ll always have the risk that something unexpected happens. And then it’s not only about bad performance if your query is not covered by a matching index. It can also be that you deliver the wrong search results (or no search results at all).
Changing index definitions (even adding synonyms or stopwords) are necessarily an admin task, and if improperly done, they impact the overall system. Not to mention the need of reindexing 😦

From my point of view if you cannot solely rely on external search engines (Google, Bing, DuckDuckGo, …) you should implement your site search not on JCR queries. It often causes more trouble than adding a dedicated Solr instance which crawling your site and which is embedded into your site to deliver the search results. You can take this HelpX article as starting point how to integrate Solr into your site. But of course any other search engine is possible as well.

Content architecture: dealing with relations

In the AEM forums I recently came across a question about slow queries. After some back and forth I understood that the poster wanted to do thousands of such queries to render a page. When rendering a product page he wanted to references the assets associated to it.

For me the approach used by the poster was straight forward, based on the assumption that the assets can reside anywhere within the repository. But that’s rarely the case. The JCR repository is not a relational database, where all you have are queries. With JCR you can also iterate through the structure. It’s a question about your content architecture and how you map it to AEM.

That means, that for such requirements like described you can easily design your application in a way, that all assets to a product are stored below the product itself.

Or for each product page there is a matching folder in the DAM where all the assets reside. So instead of a JCR query you just do a lookup of a node at a fixed location (in the first example below the subnode “assets”) or you can compute the path for the assets (/content/dam/products/prodcut_A/assets). That single lookup will always be more performant than a query, plus it’s also easier for an author to spot and work with all assets belonging to a product.

Of course this is a very simplified case. Typically requirements are more complex, also asset reuse is often required. This approach does not work that easy anymore.
And there is no real recipe for it, but ways how to deal with it.

In case of creating such relations between content we often use tags. Content having the same tag are related, and can be added automatically in the list of related content or assets. Using tags as a level of indirection is ok and in the context of the forum post also quite performant (albeit the resolution itself is powered by a single query).

Another approach to come up with modelling the content structure is to look at the workflows the authoring users are supposed to use. Because they also need to understand the relationship between content, which normally leads to something intuitive. Looking at these details might also give you a hint how it can be modeled; maybe just having the referenced assets as paths as part of the product is already enough.

So, as already said in an earlier post, there are many ways to come up with a decent content architecture, but rarely recipies. In most cases it pays of to invest time into it and consider the effects it has on the authoring workflow, performance and other operational aspects.

Creating the content architecture with AEM

In the last post I tried to describe the difference between the information architecture and content architecture; and from an architectural point of the view the content architecture is quite important, because based on that your application design will emerge. But how can you get to a stable and well-thought content structure?

Well, there’s no bullet-proof approach for it. When you design the content architecture for an AEM-based application it’s best to have some experience with the hierarchical approach offered by the repository approach. I will try to outline a process which might help you to get you there.
It’s not a definite guideline and I will never guarantee that it will work for you, as it is just based on my experience with the projects I did. But I hope that it will give some input and can act as a kind of checklist for you. My colleague Alex Klimetschek did a presentation at the adaptTo() conference 2012 about it.

The tree

But before we start, I want to remind you of the fact, that everything you do has to fit into the JCR tree. This tree is typically a big help, because we often think in trees (think of decision trees, divide-and-conquer algorithms, etc), also the URL is organized in a tree-ish way. Many people in IT are familiar with the hierarchical way filesystems are organized, so it’s both an comfortable and easy-to-explain approach.

Of course there are cases, where it makes things hard to model; but you are hit that problem, you should try to choose a different approach. Building any n:m relation in the AEM content tree is counter-intuitive, hard to implement and typically not really performant.

Start with the navigation

Coming from the information architecture you typically have some idea, how the navigation in the site should look like. In the typical AEM-based site, the navigation is based on the content tree; that means that traversing the first 2-3 levels of your site tree will create the navigation (tree). If you map it the other way around, you can get from the navigation to the site tree as well.

This definition definitivly has impact on your site, as now the navigation is tied to your content structure; changing one without the other is hard. So make your decision carefully.

Consider content-reuse

As the next step consider the parts of the website, which have to be identical, e.g. header and footer. You should organize your content in a way, that these central parts are maintained once for the whole site. And that any change on them can be inherited down the content tree. When you choose this approach, it’s also very easy to implement a feature, which allows you to change that content at every level, and inherit the changed content down the tree, effectively breaking the inheritance at this point.

If you are this level, also consider the fact of dispatcher invalidation. Whenever you change such a “centralized” content, it should be easily possible to purge the dispatcher cache; in the best case the activation of the changed content will trigger the invalidation of all affected pages (not more!), assuming that you have your /statefilelevel parameter set correctly.

Consider access control

As third step let’s consider the already existing structure under the aspect of access control, which you will need on the authoring environment.
On smaller sites this topic isn’t that important, because you have only a single content team, which maintains all the page. But especially in larger organizations you have multiple teams, and each team is responsible for dedicated parts of the site.

When you design your content structure, overlay the content structure with these authoring teams, and make sure, that you can avoid any situation, where a principal has write access to a page, but not to any of the child pages. While this is not always possible, try to follow this guidelines regarding access control:

When looking from the root node in the tree to node on a lower level, always add more privileges, but do not remove them.
Every author for that site should have read access to the whole site.

If you have a very complicated ACL setup (and you’ve already failed to make it simpler), consider to change your content structure at this point, and give the ACL setup a higher focus than for example the navigation.

My advice at this point: Try to make your ACL setup very easy; the more complex it gets the more time you will spend in debugging your group and permission setup to find out, what’s going on in a certain situation; also the harder it will be to explain it to your authors.

Multi-Site with MSM

As you went now through these 3 steps, you are through with it and already have some idea how your final content structure needs to look like. There is another layer of complexity if you need to maintain multiple sites using the multi-site-manager (MSM). The MSM allows you to inherit content and content structure to another site, which is typically located in a parallel sub-tree of the overall content tree. Choosing the MSM will keep your content structures consistent, which also means, that you need to plan and setup your content master (in MSM terms it is called the blueprint) in a way, that the resulting structure is well-suited for all copies of it (in MSM: live copies).

And on top of the MSM you can add more specifics, features and requirements, which also influence the content structure of your site. But let’s finish here for the moment.

When you are done with all these exercises, you already have a solid basis and considered a lot of relevant aspects. Nevertheless you should still ask others for a second opinion. Scrutiny pays really off here, because you are likely to live with this structure for a while.

Ways to access your content with JCR (part 1)

If you are a developer and need to work with databases, you often relay on the features your framework offers you to get your work done easily. Working directly with JDBC and SQL is not really comfortable, writing “SELECT something FROM table” with lots of constraints can be tedious …

The SQL language offers only the “select” statement to retrieve data from the database. JCR offers multiple ways to actually get access to a node:

Each of these methods serve for different purposes.

session.getNode(path) is used, when you know exactly the path of a node. That’s comparable to a “select * from table where path = “/content/geometrixx/en” in SQL, which is a direct lookup of a well-known node/row.
node.getNodes() returns all child nodes below the node. This method has no equivalent in the SQL world, because in JCR there are not only distinct and independent nodes, but nodes might have a hierarchical relation.
The JCR search is the equivalent of the SQL query, it can return a set of nodes. Yes, ANSI SQL 92 is much more powerful, but let’s ignore that for this article, okay?

In ANSI SQL, approach 1 and 3 are both realized by a SELECT query, while the node.getNodes() approach has no direct equivalent. Of course it can also realized by a SELECT statement (likely resolving a 1:n relation), but it highly depends on the structure of your data.

In Jackrabbit (the reference implementation of the JCR standard and also the basis for CRX) all of these methods are implemented differently.

session.getPath(): It starts at the root node and drills down the tree. So to lookup /content/geometrixx/en the algorithm starts at the root, then looks up the node with the name “content”, then looks for a child node named “geometrixx” and then for a child node named “en”. This approach is quite efficient because each bundle (you can consider it as the implementation equivalent of a JCR node) references both its parent and all the child nodes. On every lookup the ACLs on that node are enforced. So even when a node “en” exists, but the session user does not have read access on it, it is not returned, and the algorithm stops.

node.getNodes is even more efficient, because it just has to lookup the bundles of the child node list and filter it by ACLs.

If you use the JCR search, the Lucene index is used to do the search lookup and the bundle information is used to construct the path information. The performance of this search depends on many factors, like (obviously) the amount of results returned from Lucene itself and the complexity of the other constraints.

So, as a CQ developer, you should be aware of that there are many ways to get hold of your data. And all have different performance behaviors.

In the next posting I will explain this case on a small example.

User administration on multi-client-installations

Developing an application for a multi-client-installation isn’t only a technical or engineering quest, but also reveals some question, which affect administration and organisationial processes.

To ease administration, the user accounts in CQ are often organized in a hiearchy, so that users which are placed higher in the hierarchy, can administrate user which are lower in the hierarchy tree below them. Using this mechanism a administrator can easily delegate the administration of certain users to other users, which can also do adminstrative works for “their” users.

The problem arises when a user has to have rights in 2 applications within the same CQ instance and every application should have its own “application administrator” (a child node to the superuser user). Then this kind of administration is no longer possible, because it is impossible to model a hierarchy where neither application administrator user A has a parent or child relation to application administration user B nor A and B are placed in the hierarch higher than any user C.

I assume that creating accounts for different application but the same person isn’t feasible. That would be the solution which the easiest one from an engineering point of view, but this does contradict the ongoing move not to create for each application and each user a new user/password pair (single sign on).

This problem imposes the burden of user administration (e.g assigning users to groups, resetting passwords) to the superuser, because the superuser is the user, which is always (either by transition or directly) parent to any user. (A non-CQ-based solution would be to handle user related changes like password set/reset and group assignment outside of CQ and synchronize these data then into CQ, e.g. by using a directory system based on LDAP.)

ACLs, access to templates and workflows should be assigned only using groups and roles, because these can be created per application. So if an application currently is based on a user hierarchy and individual user rights it’s hard to add a new application using the same user.

So one must make sure, that all assignments are only based on groups and roles, which are created per application. Assigning individual rights to a single user isn’t the way to go.