Why is it hard to do disk size estimations upfront?

Whenever a new project starts, a project manager is responsible to do a project sizing, so that the right amount of people with the right skills are assigned to the project. In many projects another early task is to size the hardware. This has mostly to do with the time to buy and deploy new hardware, which can be pretty long. On one project I did it took the IT 10 months (!!) from the decision to buy 8 of “these” boxes until they have been able to login on that box. And by the way, this was the regular hardware purchasing process with no special cases …

Anyway, even if takes only 6 weeks for the whole “new hardware purchase and deployment” process, you cannot just start and determine then what hardware is needed. When development starts, a development infrastructure must be provided, consisting of a reasonable amount of systems with enough available resources. So one of the earliest project tasks is an initial system sizing.

If you have done it a few times for some specific types of projects (for example CQ5 projects) you can give some basic sizing, without doing major calculations; at that time you usually doesn’t have enough information to do a calculation at all. So for a centralized development system (that’s where the continous integration server deploys to) my usual recommendation is “4 CPUs, 8-12G RAM, 100G free disk; this for 1 author and 1 publish”. This is a reasonable system to actually run development on. (Remember, that each developer has on her laptop also an authoring and publishing system deployed, where they actually try out their code. On these central developement systems all development tests are executed, as well as some integration tests.)

This gets much harder, if we talk about higher environments like staging/pre-production/integration/test (or however you might call it) and — of course — production. Because there we have much more content available. This content is the most variable factor in all calculations, because in most requirement documents it is not clear, how much content will be in the system in the end, how much assets will be uploaded, how often they are changed, and when they will expire and can be removed. To be honest, I would not trust any number given in such a document, because this usually changes over time and even during the very first phase of the project. So you need to be flexible regarding content and also regarding the disk space calculation.

My colleague Jayan Kandathil posted a calculation for the storage consumption of images in the datastore. It’s an interesting formula, which might be true (I haven’t validated the values), but I usually do not rely on such formulas because:

We do not only upload images to DAM, and besides the datastore we also have the TarPM and the Lucene index which contribute to the overall repository growth.
I don’t know if there will be adjustments to the “Asset update” workflow, especially if more/less/changed renditions will be created. With CQ 5.5 also any change of asset metadata will affect the datastore (XMP writeback changes the asset binary! This results in a new file in the datastore!).
I don’t know if I can rely on the numbers given in the requirements document.
There is a lot of other content in the repository, which I usually cannot estimate upfront. So the datastore consumption of the images is only a small fraction of the overall disk space consumption of the repository.

So instead of calculating the disk size based only on assumptions, I usually tell a disk size to start with. This number is soo high, that they won’t fill it up within the first 2-3 months. But it is also not that large, that they will never ever reach 100% of its size. It’s somewhat in between. So they can go live with it and they need to monitor it. Whenever they reach 85% of the disk size, IT has to add more disk space. If you run this process for some time, you can do a pretty good forecast on the repository growth and react accordingly by attaching more disk space. I cannot do this forecast upfront, because I don’t have any reliable numbers.

So, my learning from this: I don’t spend that much time in disk calculations upfront. I only give the customer a model, and based on this model they can react and attach storage in a timely manner. Also this is the most cheapest version, because you attach storage only when it’s really needed and not based on some unreliable calculation.

Meta: What happend in 2012

2012 was a very successful CQ5-year for me (and I hope also for you). The CQ 5.5 release early in 2012 brought interesting new features (mostly notable: the new hotfix policy) and also interesting projects. I worked with some great people on these projects.

I also “revived” this blog with a number of new blog entries, which seem pretty popular. Because of you! So many thanks to all of my readers. A wide audience (and feedback) keeps me motivated to blog for more topics regarding CQ5.

In 2012 I wasn’t able to attend any conference. I will change this for 2013, so maybe you can meet me in person, but currently there aren’t plans made yet.

So, I wish you and your families a happy and successful year 2013.

Ways to access your content with JCR (part 2): Performance aspects

In the previous post I described ways how you can access your data in JCR. I also showed, that the performance of these ways is different.

For the direct lookup of a node the complexity depends on the number of path elements, which need to be traversed from the root node to that node. Also the number of child nodes on each of these levels has an impact. But in general this lookup is pretty fast.
If you just iterate through child nodes (using node.getChildren()), it’s even faster, the lookup complexity is constant.
The JCR search as third approach no general estimation can be given, it depends too much on the query.

First, the JCR query consists of 2 parts: An index lookup and operations on the JCR bundles.

Note: Of course you can build queries, where an index lookup is not required and might be optimized by the query engine; for example “//jcr:root/content/geometrixx/*” would return all nodes below /content/geometrixx, but building such queries isn’t useful at all, and I consider them as a mis-use of JCR queries.

This combination is usually in such a way, that the index lookup produces a set of possible results, which are then filtered by the means of JCR, e.g. by applying path constraints or node type restrictions. In every case the ACLs taken into account.

Let’s consider this simple example:

/jcr:root/content/geometrixx/en//*[jcr:contains(., 'support')]

First, it looks up all properties for the search term “support”. As the backing system for JCR search is Apache Lucene, and Lucene is implemented as inverted index, direct lookups like this are extremely efficient.
Then for all results the path is calculated. This means, that for each result item the parent is lookup recursively until the root node. In that process the ACL checks are performed.

As soon as the query gets complicated and Lucene delivers many results (for example because you are looking for wildcards) or you do complex JCR-based operations in the query, this isn’t that easy and performant any more. The more nodes you need to load to execute a query (and for all path and ACLs evaluations you need to load the bundle from disk to your BundleCache) the more time it takes.

But if you traverse a subtree with node.getChildren() only these bundles are loaded to the BundleCache for evaluation.

So in many cases, especially when you need to search a small subtree for a specific node, it’s more efficient to manually traverse the tree and search for the node(s) than to use JCR search. This means, you use the other 2 approaches listed above. You might not be used to it when you worked with a relational database for years, but it is a very feasible way with possibilties of huge performance benefits.
So, give it a try. But don’t expect differences on your developer machine with a blazing fast SSD and 1 gigabyte repository size. Test it on your production-size repository!

Ways to access your content with JCR (part 1)

If you are a developer and need to work with databases, you often relay on the features your framework offers you to get your work done easily. Working directly with JDBC and SQL is not really comfortable, writing “SELECT something FROM table” with lots of constraints can be tedious …

The SQL language offers only the “select” statement to retrieve data from the database. JCR offers multiple ways to actually get access to a node:

Each of these methods serve for different purposes.

session.getNode(path) is used, when you know exactly the path of a node. That’s comparable to a “select * from table where path = “/content/geometrixx/en” in SQL, which is a direct lookup of a well-known node/row.
node.getNodes() returns all child nodes below the node. This method has no equivalent in the SQL world, because in JCR there are not only distinct and independent nodes, but nodes might have a hierarchical relation.
The JCR search is the equivalent of the SQL query, it can return a set of nodes. Yes, ANSI SQL 92 is much more powerful, but let’s ignore that for this article, okay?

In ANSI SQL, approach 1 and 3 are both realized by a SELECT query, while the node.getNodes() approach has no direct equivalent. Of course it can also realized by a SELECT statement (likely resolving a 1:n relation), but it highly depends on the structure of your data.

In Jackrabbit (the reference implementation of the JCR standard and also the basis for CRX) all of these methods are implemented differently.

session.getPath(): It starts at the root node and drills down the tree. So to lookup /content/geometrixx/en the algorithm starts at the root, then looks up the node with the name “content”, then looks for a child node named “geometrixx” and then for a child node named “en”. This approach is quite efficient because each bundle (you can consider it as the implementation equivalent of a JCR node) references both its parent and all the child nodes. On every lookup the ACLs on that node are enforced. So even when a node “en” exists, but the session user does not have read access on it, it is not returned, and the algorithm stops.

node.getNodes is even more efficient, because it just has to lookup the bundles of the child node list and filter it by ACLs.

If you use the JCR search, the Lucene index is used to do the search lookup and the bundle information is used to construct the path information. The performance of this search depends on many factors, like (obviously) the amount of results returned from Lucene itself and the complexity of the other constraints.

So, as a CQ developer, you should be aware of that there are many ways to get hold of your data. And all have different performance behaviors.

In the next posting I will explain this case on a small example.

CQ coding patterns: Sling vs JCR (part 3)

In the last 2 postings (part 1, part 2) of the “Sling vs JCR” shootout I only handled read cases, but as Ryan D Lunka pointed out in the comments of on part 1, writing to the repository is the harder part. I agree on that. While on reading you often can streamline the process of reading nodes and properties and ignore some bits and pieces (e.g. mixins, nodeTypes), on the write part that’s a very important information. Because on the write part you define the layout and types, so you need to have full control of it. And the JCR API is very good in giving you control over this, so there’s not much room for simplification for a Sling solution.

Indeed, in the past there was no “Resource.save()” method or something like this. Essentially the resourceResolver (as a somehow equivalent of the JCR session) was only designed for retrieving and reading information, but not for writing. There are some front ends like the Sling POST servlet, which access directly the JCR repository to persist changes. But these are only workarounds for the fact, that there is no real write support on a Sling Resource level.

In the last months there was some activity in that area. So there is a wiki page for a design discussion about supporting CRUD operations within Sling, and code has been already submitted. According to Carsten Ziegler, the submitter of this Sling issue, this feature should be available for the upcoming version of CQ. So let’s see how this will really look like.

But till then you have 2 options, to get your data into the CRX repository:

Use the SlingPostServlet: Especially if you want to create or change data in the repository from remote, this servlet is your guy. It is flexible enough to cater most of normal use cases. For example most (if not all) of the CQ dialogs and actions use this SlingPostServlet to persist their data in the repository.
If that’s not sufficient for you, or you need to write structures directly from within code deployed within CQ, use JCR.

So here Sling currently does not bring any benefit, if your only job is to save data from within OSGI to CQ. If many cases the Default POST Servlet might be suffcient when you need to update data in CRX from external. And let’s see what CQ 5.6 brings with it.

CQ coding patterns: Sling vs JCR (part 2)

In the last posting I showed the benefits of Sling regarding resource access over plain JCR. But not only in resource access both frameworks offer similar functionality, but also in the important area of listening to changes in the repository. So today I want to compare JCR observation to Sling eventing.

JCR observation is a part of the JCR specification and is a very easy way to listen for changes in the repository.

@component (immediate=true, metatype=false)
@service
class Listener implements ObservationListener {

  @Reference
  SlingRepository repo;

  Session session;
  Logger log = LoggerFactory.getLogger (Listener.class);

  @activate
  protected void activate () {
    try {
      Session adminSession = repo.loginAdministrative(null);
      session = adminSession.impersonate (new SimpleCredentials("author",new char[0]));
      adminSession.logout();
      adminSession = null;
      session.getObservationManager.addEventListener( this, // listener
        NODE_CREATED|NODE_DELETED|NODE_MOVED, // eventTypes
        "/", // absPath
        true, // isDeep
        null, // uuid
        null, //nodeTypeNames
        true // noLocal
      );
    } catch (RepositoryException e) {
      log.error ("Error while registering observation", e);
    }
  }

  @deactivate
  protected void deactivate() {
    session.getObservationManager.removeListener(this);
    session.logout();
    session = null:
  }

  private handleEvents (Events events) {
    while (events.hasNext()) {
      Event e = events.next();
      … // do here your event handling
     }
  }
}

In JCR the creation of an observation listener is straight forward, also the event listening. The observation process is tied to the lifetime of the session, which is started and stopped at activation/deactivation of this sling service. This kind of implementation is a common pattern.
Note: Be aware of the implications of using an admin session!

You can read and write everywhere in the repository
You work with elevated rights you probably never need.

So try to avoid an adminSession. In th example above is use a session owned by the user “author” for this via impersonation.

A sling implementation of the same could look like this:

@Component (immediate = true)
@Service()
@Property (name = "event.topics", value = "/org/apache/sling/api/resource/Resource/*")
class Listener implements EventHandler {

  public void handleEvent (Event event) {
    // handle
  }
}

You see, that we don’t have to care about registering a listener and managing a session. Instead we just subscribe to some events emitted by the Sling Eventing framework. This framework is essentially implementing the OSGI event specification, and therefor you can also subscribing the very same way to various other event topics. You can check the “Event” tab in the Felix console, where you can see a list of events, which just happened in your CQ.

Some example for topics:

BundleEvents (the complete OSGI bundle live cycle is reflected here)
CQ Replication
CQ Workflow
Sling Jobs
Sling Resources

One disadvantage of Sling eventing is, that you cannot restrict the resource events to a certain path or user. Instead Sling registers with an admin session and publishes all observation events starting at the root node of the repository (/). You should filter quite early in your handleEvent routine for the events you are really interested in.

But: With these events you don’t geld hold of a resource. You always have you create them by this pattern:

if (event.getProperty(SlingConstants.PROPERTY_PATH) != null) {
  Resource resource = resourceResolver.getResource(event.getPath()); 
}

Also there you need to take care, that you’re using an appropriate ResourceResolver (try to avoid an admin resource resolver as well).

Also JCR observation has some benefits, if you want or need to limit the events you receive more granularly, for example by listening to changes of specific nodeTypes only or mixins.

So, as a conclusion: If you need to restrict the amount of repository events you want to process directly on API level, JCR observation is the right thing for you. Otherwise go for Sling Eventing, as it offers you also the possibility to receive many more types of events. For non-repository events it’s the only way to listen for them.

CQ5 coding patterns: Sling vs JCR (part 1)

CQ5 as a complex framework is built on top of various other frameworks, on the server-side the most notably ones are JCR (with its implementation Apache Jackrabbit) and Apache Sling. Both are very powerful frameworks, but both of them have some overlap in functionality:

reading data from the repository (Sling Resources vs JCR nodes/properties)
notification on repository events (Sling Eventing vs JCR observation)

In these 2 areas you can work with both frameworks, and achieve good results with both. So, the question is, in what situation should you prefer Sling and in what situation pure JCR.

First, and I hope you agree here, in 99% of all cases use the an abstracted framework is recommended over the use of concrete technology (everybody uses JDBC and noone a direct database interface for e.g. MySQL) . It usually offers more flexibility and an easier learning cure. Same here. While on pure JCR you only work with raw repository structures (nodes and properties), the Sling resource abstraction offers you easier handling (no need to deal with the repository exceptions any more) and much more options to interact with you business objects and services.

As an example let’s assume, that we need to read the title of CQ page /content/geometrixx/en/services. Using the JCR API it would look like this:

String readTitle (Session session) {
  Node page = session.getNode("/content/geometrixx/en/services");
  Node jcrcontent = page.getChild("jcr:content");
  Property titleProp= jcrcontent.getProperty ("title"):
  String title = titleProp.getValue().getString();
  return title;
}

This is a very small example, but it shows 3 problems:

We need to know, that all CQ page properties are stored on the jcr:content node below the page
the title of the page is encoded in the property “title”.
Properties are not available immediately as basic types, and need to be converted from Properties to a Value to our expected type (String)

The same example with Sling:

String readTitle (ResourceResolver resolver) {
  Resource r = resolver.getResource(/content/geometrixx/en/services");
  Page page = r.adaptTo(Page.class);
  String title = page.getTitle();
  return title;
}

We don’t need to deal with the low-level information (like the jcr:content node) and properties, but we use the appropriate business object (a CQ page object), which is available out of the box and offers a better level of abstraction.

On a Sling resource level we also a bunch of helpers available, which offer some great benefits over the use of plain nodes and properties:

* the adaptTo() mechanism allows to convert a resource into appropriate objects representing a certain aspect of this resource, for example:
```
 
 LiveCopyStatus lcs = resource.adaptTo (LiveCopyStatus.class);
PageManager pm = resource.adaptTo (PageManager.class);
```
see http://localhost:4502/system/console/adapters for a (incomplete?) list.
The ValueMap is abstracting away the Property type, type conversions are then done implicitly.
And likely many many more.
And if you ever need to deal with JCR directly, just use
```
Node node = resource.adaptTo(Node.class);
```

So, there are many reasons to use Sling instead of the JCR API.

CQ Exams

In the past months Adobe has released 2 exams for CQ5 developers:

CQ 5.5 component developer exam (altough the name might imply, this exam does not only cover the development of components and templates!)
CQ 5.5 lead developer exam

Both exams cover a broad range of topics regarding CQ5 and its building blocks. For most developers surprising, also sysadmin related questions are asked, like details on the dispatcher configuration. The topic list in the referenced PDFs are quite comprehensive and give you some good overview which areas are covered by the questions. Be prepared to answer some questions down to the property level… And just as a warning: You need practice experience with CQ5, from my point of view you cannot pass with just having attended all trainings!

I wish you all, who want to take that exam, good luck. It’s doable.

“Concurrent users” and performance tests

A few years ago when I was still working in application management of a large website we often had the case, that the system was reported to be slow. Whenever we looked into the system with our tooling we did not found anything useful, and we weren’t able to explain this slowness. We had logs which confirmed the slowness, but no apparent reason for it. Sadly the definition of performance metrics was just … well, forgotten. (I once saw the performance requirement section in the huge functional specification doc: 3 sentences vs 200 pages of other requirements.)

It was a pretty large system and rumors reported, that it was some of the biggest installations of this application. So we approached the vendor and asked how many parallel users are supported on the system. Their standard answer “30” (btw: that’s still the number on their website, although they have rewritten the software from scratch since then) wasn’t that satisfying, because they didn’t provide any way to actually measure this value on the production system.

The situation improved then a bit, got worse again, improved, … and so on. We had some escalations in the meantime and also ran over months in task force mode to fight this and other performance problems. Until I finally got mad, because we weren’t able to actually measure the how the system was used. Then I started to define the meaning of “concurrent users” for myself: “2 users are considered concurrent users, when for each one a request is logged in the same timeframe of 15 minutes”. I wrote a small perl script, which ran through the web server logs and calculated these numbers for me. As a result I had 4 numbers of concurrent users per hour. By far not exact, but reasonable to an extent, that we had some numbers.

And at that time I also learned, that managers are not interested in the definitions of a term, as long as they think they know what it means. So actually I counted a user as concurrent, when she once logged in and then had a auto-refresh of a page every 10 minutes configured in her web browser. But hey, no one questioned my definition and I think the scripts with that built-in definition are still used today.

But now we were able to actually compare the reported performance problems against these numbers. And we found out, that it was related sometimes (my reports showed that we had 60 concurrent users while the system was reported to be slow), but often not (no performance problems are reported but my reports show 80 concurrent users; and also performance problems with 30 reported users). So, this definition was actually useless… Or maybe the performance isn’t related to the “concurrent users” at all? I alread suspected that, but wasn’t able to dig deeper and improve the definition and the scripts.

(Oh, before you think “count the active sessions on the system, dude!”: That system didn’t have any server side state, therefor no sessions. And the case of the above mentioned auto-reload of a the startpage of a logged in user will result in the same result: She has an active session. So be careful.)

So, whenever you get a definition of “the system should support X concurrent users with a maximum response time of 2 seconds”, question “concurrent”. Define the activities of these users, build according performance tests and validate the performance. Have tools to actually measure the metrics of a system hammered with “X concurrent users” while these performance tests. Apply the same tooling to the production. If the metrics deliver the same values: cool, your tests were good. If not: Something’s wrong: either your tests or the reality…

So as a bottom line: Non-functional requirements such as performance should be defined with the same attention as functional requirements. In any other case you will run into problems. And

Meta: upcoming events 2012

A new posting with some topics, where I don’t tell you something about CQ 5.5, but about some upcoming events. These might be relevant for you, because they both are about CQ5.

First, there is the adaptTo() conference in Berlin, Germany from September 26th to 28th. Last year there this conference was a huge success, and I got very positive feedback from colleagues and coworkers about it. Some people from CQ5 engineering will be attending this 3-day conference as well, so expect cool technical topics and a lot of QA sessions (official and unofficial ones). As usual the get-together will be most interesting part. So if you are working in the CQ5 business and you are interested in more technical topics: this is the event especially for you!
Sadly I cannot make it this year, because my schedule is too tight.

And as a second event there is “Evolve12” in San Diego, California (from Oct 15th to 17th). As this conference is the first one, I don’t have too much information about its character, but having Kevin Cochrane and David Nüscheler on the list of speakers, this will be a little bit more strategic and marketing oriented 🙂 Although 2 of the 3 tracks claim to be technical … well, let’s see.
This event is accompanied by a special request on the google groups list (which is btw a must for every CQ5 developer, check http://groups.google.com/group/day-communique?hl=en): Everybody should nominate her personal “TOP CQ Google Group Contributor”, and I encourage you to do so.
Currently I don’t plan to attend this conference as well, as it would mean to be at least a week off (and again, my work schedule is tight). But let’s see…