The future of CQ Healthcheck

A few days ago Sling Healthcheck 1.0 was released. My colleague Betrand Delacretaz intiated and pushed that project and he did a great job. And now there’s a monitoring solution where it belongs: In the sling framework, on Apache, with a much greater visibility my personal pet project would ever get. I don’t have the the possibility to spend much time on it, and in fact never wanted to run such a thing on my own. Fouding the CQ healthcheck project was necessary to push the topic “monitoring” and to make it visible. And now I am glad, that Betrand picked it up. I fully trust him that he will push it on the Sling side and also inside Adobe, so we can expect to see the healtcheck functionality in the next CQ release. And that’s exactly what I ever wanted to achieve: A usable and extensible monitoring solution available out of the box in CQ.

So, I think, that the mission of the CQ healthcheck project is over; so I will discontinue the development on github. I will leave the code there and you can still fork it and restart the development.

Meta: AdaptTo 2013

As you might have seen, I made a small pause this summer with publishing new articles. I hope it will change soon. I have some interesting topics in mind.

I will attend this year’s AdaptTo event in Berlin. I am looking forward to meet some my colleagues (some I know, some I don’t know yet in person) and all the people from the community, partners, Adobe customers, etc. I hope I can meet your there as well 🙂 So see you in Berlin.

JVM tuning or premature optimization?

JVM tuning is sometimes a big topic on projects. Depending on the experience people made with previous projects they start to apply a whole bunch of parameters immediately to the JVM, before they even run the application the first time.Or they just don’t know, that there are parameters beyond -Xmx and -Xms. So, what’s the way you should go?  Before I will give my personal recommendation about it, let me explain a few things first.

JVM tuning (which is in most cases only “Garbage Collection optimization”) is a horribly complex topic for a few reasons:

  •  All major JVMs (Sun/Oracle Java, IBM Java, and let’s also add OpenJDK) have emerged over a long time, and many many man-years of research went into it, to make it efficient. Efficient to an extent, that it sometimes even beats compiled code. For this the engineers implemented hotspot-compilation, code-analyzers and optimizers and lot of interesting stuff more. Also the garbage collection algorithms improved vastly over time. In the end I don’t think, that anyone outside the core engineering team of this JVM understands the full impact of the more complicated JVM switches.
  • Most JVMs have a lot of auto-tuning parameters, which adapt itself to a certain extent to the application. A lot of heuristics & and statistics has been built-in to avoid major tunings. (Of course if these auto-tuning is working against you, you should start tweaking parameters yourself.)
  • When switching over to a new version of the JVM, no one guarantees, that the set of JVM parameters you have applied from the previous version will continue to work smoothly. In most cases these switches are not really good documented (probably by purpose) and they are not guaranteed to be effective.
  • And especially in the context of garbage collection tuning: You tune for a specific application and for its specific behavior. Some applications have a lot of long-living objects and have high concurrency in accessing these objects. Others create a lot of short-living objects in a thread and dispose them after some task has been performed and create no or a very small number of long-living objects. For these 2 extreme scenarios your garbage collection will behave completely different and when you want or need to optimize it, you need to optimize differently and also for different goals. In a batch-scenario you will optimize for high throughput while in an interactive scenario you optimize for low latency (and sacrifice a little bit throughput for it).

So, just because you have a proven set of parameters from a different project, you cannot blindly apply it. In the best case it might help a little bit, but you don’t know why. In the normal case, these switches just don’t have a measurable impact on performance. And in the worst case, you degrade your performance, and you also don’t know why (but you have the feeling, that you have done it right).

So, what are my recommendations for JVM tuning for CQ5 applications:

  1.  Do not try to optimize the JVM at all. In the majority of CQ installations I’ve worked with no optimization has been performed and the JVM behaved very well. Analysis has shown no problems in the JVM and garbage collection, so there was no need to modify here.
  2. Have enough information to analyze in case you have a problem. I usually add a few informational parameters to the JVM parameters, like garbage collection logging, creating a heap dump on out of memory and things like that. These are not tuning parameters in the first sense, but give me more information about the JVM itself. Normally I don’t need them, but in case of problems it is good to have these information.
  3. And if you have a problem you track down the JVM itself, ask an expert on that topic. Just googling and blindly adding parameters to the JVM isn’t helpful, if you don’t know, what this parameter changes at all and if this is relevant for you. And then don’t forget to prove that your change has a positive effect.

My recommended JVM parameters for Oracle JVM (1.6) besides the default ones specified by the CQ start script:

  • -XX:+HeapDumpOnOutOfMemoryError — write an heapdump if the JVM runs out of heap space
  • -XX:+HeapDumpOnCtrlBreak — on windows create an heapdump when pressing ctrl+break
  • -verbose:gc — verbose garbage collection logging
  • -XX:+PrintGCTimeStamps — print timestamps in the GC log
  • -XX:+PrintGCDetails — even more GC information
  • -Xloggc:gc.log — write the GC logs to this file and not to stdout
  • -XX:+UseGCLogFileRoatation (since Java 6u32 or Java 7u2) — enable GC log rotation
  • -XX:+NumberOfGCLogFiles=10 (since Java 6u32 or Java 7u2) — and keep 10 versions of that GC log file

To understand how the memory model of modern JVMs work (especially the generational heap model of the Oracle JVM), please read this great article on Java Garbage Collection.

While I appreciate the work my colleague Jayan Kandathil did (he posted some settings he benchmarked to be most effective for his specific type of workload), I would not recommend that you just blindly apply these settings on your own instance. Because very likely your workload will be a bit different.

CQ development patterns – JCR observation

As mentioned in earlier postings, JCR observation is a very cool concept, is it allows you to react on changes in the repository; this allows you to change any written data after the write in a centralized manner, instead of attaching a post processing step to each and every method before it is doing the save to the repository.

A good example for this are the CQ DAM Asset Update workflows. There are many ways how you can ingest assets into your CQ: direct upload via browser, webdav, sync from Creative Cloud, uploads via your own bulk ingestions methods, the Sling POST servlet, … And instead of attaching the metadata extraction, the rendition generation and all the other stuff to each of these methods, a single centralized process takes care of it. It doesn’t matter that this process is asynchronous to the ingestion itself. In this case it’s a workflow, but the workflow triggers are also directly based on JCR observation.

So JCR observation is a powerful concept, and is widely used for many different purposes.

But as it is a very fundamental concept and basis for a lot of features within CQ, you can do also a lot of harm in there. So I want to discuss the most important feature of JCR observation.

JCR observation is single-threaded

In Jackrabbit 1 and Jackrabbit 2 the JCR observation is a single-threaded mechanism. So for every event each registered listener is evaluated, and if there’s a match, the onEvent method of that listener is called by that thread.

Recommendation 1
The code executed in that onEvent method should be really fast. No long-standing calculation, no network access.

Recommendation 2
If you have a long-standing calculation or if you need to do network access, do it in a dedicated thread and run it asynchronously to the JCR observation.

Recommendation 3
Be careful when your event listener is writing to the repository as well. It there’s already a heavy write load, your event listener might block when it comes to writing to the repository. This will add additional delay to the JCR observation processing.

Recommendation 4
If your JCR observation handler needs to be active only on certain instances, make the register process configurable. Based on run mode or OSGI configuration do not do the “addEventListener” call. This avoids  the small overhead of executing an Event listener, which is not required on this instance.

Recommendation 5
If you get the WARN statement in your log files, that you have more than 200k pending events in the observation queue, your alarm clocks should ring. In that case all save() operations are delayed until the queue goes below that limit again (see http://www.docjar.com/html/api/org/apache/jackrabbit/core/observation/ObservationDispatcher.java.html); if you encounter this message, your first priority should be to analyze your observation listeners and speed them up.

Recommendation 6
If you want to know which observation listener consumes how much time, you should set the class “org.apache.jackrabbit.core.observation.EventConsumer” to DEBUG; it will then print for each event (!), which listener consumed how much time.

So, as a conclusion, if you use use JCR observation, be aware, that you the code fast you want to execute. Everything which exceeds a few milliseconds should be offloaded to a separate thread. And check, that you run only the observation listeners which are really required to run.

CQ development patterns – Sling ResourceResolver and JCR sessions

When you work with CQ, you will find Resources and (to a lesser extent) Nodes everywhere. Many services work on the JCR repository and use the basic JCR API with its nodes and properties abstraction. Others work with Sling Resources and high level objects like Pages and DAM assets.

So, unless you are restricted to the presentation layer and component implementations, you should know the basics how to deal with JCR sessions and Sling ResourceResolver. Because that’s the basic tooling you should be able to handle. And it isn’t that hard to work with, in “Sling vs JCR” I already described how you can access data stored in the repo and work it, both on JCR and sling level.

Some words to the Sling resource resolver: Although the notion of Sling resources goes beyond the scope of a JCR repository, we have in many CQ projects a 1:1 match between a Sling resource and a JCR node. Just simple because the JcrResourceProvider is the one mounted to the root of your Sling resource tree. And in that case, whenever you open a ResourceResolver, it’s a 100% chance that you will open a JCR session as well, behind the scenes. And for the same reason you should call “close()” on every resolver resolver if you don’t need it anymore: Internally the corresponding session is logged out then.For this reasons many patterns we have for JCR sessions can also be applied to Sling ResourceResolvers.

So today I want to present some patterns or best-practices how you should work with Sling resource resolvers and JCR sessions.

1st rule: Session leaks = memory leaks
Whenever you open a JCR session, there is a reference to the JCR repository object; this link prevents the session object from being garbage collected even if you drop all references to that session object. Every session will consume a little bit of memory unless the logout() method is called explicitly. If you omit this call and create lots of sessions, you risk an out-of-memory exception by your JVM, which terminates the CQ instance. A single leaked session isn’t a problem, but if you have hundreds or thousands of leaked sessions, it might turn into a problem. Gladly you will find lot’s of warnings in CQ’s error.log, so it isn’t that hard to spot them.

A simple pattern to avoid such a situation is to limit to scope of a JCR session to the scope of a java method or OSGI service/component. At the beginning create the session and the end terminate it.

In a method scope this could like this:

@Reference
SlingRepository repo;

private void doSomething() {
  Session session = null;
  try {
    session = repo.loginAdministrative();
    ...
  } catch (RepositoryException e) {
    // log the exception
  } finally {
    if (session != null && session.isLive() {
      session.logout();
    }
  }
}

This pattern will always cleanup the session and is therefor guaranteed not leak any memory due to sessions.

In many cases you need a single session during the lifetime of a service or component. If you have that requirement, you can use this pattern:

@Component(…)
public class UsesSession {

  @Reference
  SlingRepository repo;

  Session session;

  @Activate
  protected void activate() {
    session = null;
    try {
      session = repo.loginAdministrative();
      …
    } catch (RepositoryException e) {
      // log the exception
    }
  }

  @Deactivate
  protected void deactivate() {
    if (session != null && session.isLive()) {
      session.logout();
    }
  }

}

This approach makes sure, that after the stop of the component the session is closed, and no session leaks occurs.

2nd pattern: Ownership: You open it — you close it.
The basic rule of ownership is: If you open a JCR session of a Sling ResourceResolver, you are also responsible for closing it. On the other hand: If you get passed a ResourceResolver or a Session object, do not call logout() or close() on it. This works closely together with the 1st pattern, as it clearly makes sure, that only a single instance should be responsible for the lifecycle operations of a session or a resource resolver.

3rd rule: Prefer the ResourceResolver over the JCR Session.
The sling resource resolver is a very powerful concept, and in many cases you should prefer it over the JCR session; I already wrote in “JCR vs Sling” about this topic, so let me just repeat the main advantage in short: adaptTo() to high-level business concepts and objects.

4th rule: Do not share nodes, resources and pages between threads and requests.
You might run into the case, that you need to share data between multiple threads or requests. A simple example for this is HTTP-Session or a cache which stores data you need to share between multiple or all incoming requests.
In such cases do not store nodes, pages or DAM assets or any other object based on a node or a resource in that cache. Each of these objects belongs to a certain session (or resource resolver instance), and they are not valid anymore if this session is logged out or if the resource resolver is closed; if you still try to execute methods on them, you will run into exceptions.

If you want to share such data, share objects which are not bound to any Session. So instead of putting pages in there, put the path to the page there instead. If you need the page again then, just use your own ResourceResolver, resolve it to a resource and adapt it to a page. You still can run into access control issues, if the second resources resolver does not see that node due to ACL constraints. In such a case you should think about your application design.

Update (2018-10-26): 5th rule: Adapting a ResourceResolver to a session object does not create a new session!

When you have an already open ResourceResolver, adapting it to a Session.class will just expose the internally used JCR session; it will not create a new Session for it. Therefor you must not close this session.

Performance tests (5): Environmental factors

If you are doing sports like cycling or running, you know about the importance of the environmental factors. If you are doing a running-championship under severe conditions like heavy rain, you know upfront, that it is very unlikely to get a new world record. The performance of all runners is reduced by the environmental conditions. On the other hand if you want to swim through the British Channel, it is not sufficient to train only in your indoor swimming pool. You need to train under more severe conditions.

It’s the same with performance tests. They are impacted by the environment of the systems they run on. If your test needs to copy a large amount of data to another system via the network, the performance is highly dependent on the available bandwidth and the latency between these systems. If you compare 2 runs of this test, one on a loaded network connection (another application is also copying data) and one with an unloaded connection, you’ll get a huge gap. But we run the same test case on the same systems. So the environment also influences our tests and the test results.

So we always have to consider the impact of the environmental factors. These factors could occur anywhere at any time, and we cannot influcene them directly.

  • Other applications add additional load on our systems, so we don’t have the same amount of resources available for the test execution. This is especially a problem when you have shared hardware (virtualization); although virtualization on commodity x86-64 hardware gets better and better, you still might have.
  • Regular processes like backup or TarPM optimizer might interfere with your performance test. Of course these processes have to run. But you should be prepared for them and know of them. If they run without you knowing that, the results of the tests are different than you expect and you need to research the reasons and rerun the tests. Lot of lost time.
  • The network might have reduced performance or availability due to maintenance or other tests.
  • Same for all connected backend-systems (LDAP server, shop backend, …)

Of course also severe conditions can be found on production systems as well, like a bulk data transfer slowing down the networking connections. But in most production systems such activities are much more planned than “just on a test system”, where it’s more likely that such things happen without announcement and less planning.

There are 2 ways to mitigate such problems:

  1. You need to have a list of systems and applications, which might affect your performance during the performance test. Be careful here, because each of them can create outliers on your test, which are hard to explain when you are not aware of them.
  2. Open ommunication. When you plan a series of performance tests, communicate this to all the parties maintaining systems on this list. You can resolve conflicting dates (2 parties doing tests on the same day) before you get incorrect results.

The basic approach is here: You cannot mitigate severe environmental impacts. Your production needs to deal with them as well, therefor their appearance can be a good preparation. But when you do testing, you should be able to state clearly, what is caused by your test scenario and what is caused by the environment.

A nice example how such tests can created to improve the resilience against unwanted problems in the production, is the Simian army of Netflix. They validate their infrastructure, their application and their processes by turning off systems at random (chaos monkey) or even all their systems in a datacenter or amazon AWS availability zone. Just to validate that they still can deliver their service. And, as proved by the last amazon outages it’s useful, as Netflix was less affected than other services leveraging the same amazon infrastructure. Because they already had the firedrill, the adjusted application design and the required processes to cope with such situations.

Infrastructure sizing for projects

Sizing the infrastructure is always important when you start new IT projects, which require their own set of systems, processes and information. When you start a CQ5 project from scratch, you will have a discussion about CPUs, RAM, storage, network infrastructure and so on. That’s a common pattern and not specific at all to CQ5 projects.

In many cases you do an upfront calculation how the situation will be in 3 years (that’s a common interval for hardware replacement), and then you go shopping for hardware. In cases, where you can qualify the needs quite well, this approach is a good way to go. Because you have the discussion only once and you can ask for the adequate infrastructure budget when you do the budget discussion for the project. Though it’s a pretty bad situation, when you need to apply for a new hardware budget after 18 months. Because then you admit, that the initial calculation was wrong and you did a bad job at project start. At least, that’s often the way how people (especially the controllers) see this.

But assumptions, that you can do all the math upfront, are not valid for every project. In fact in many projects you won’t know how the situation will be in 24 or 36 months. If the project is successfull or if it has been cancelled in the meantime (for whatever reason). Sometimes external factors influence your systems heavily. Or the requirements change drastically. For example the rise of Facebook or Twitter often reduced the requirements for dedicated community areas on the company’s website. Or it just turned out, that despite Facebook and Twitter your user ask for a community site for your products.

So, in these cases any initial calculation will be wrong. But you don’t know that when you do your initial budget and your initial system sizing. The only thing you know is: “In 6 months I understand much better the requirements for system performance and how my systems should look like“.Because after these 6 months, you learned a lot about the requirements, the project and also about its rise or fall. You don’t need to assume and guess any more, you know and you calculate. Any sizing you do at this point in time is much more accurate, because it is based on actual numbers. It fits better to the problems, because you already know some bottlenecks and you can react on them now. And in the end it is cheaper because you have only the hardware you need right now; you haven’t spend your dollars and Euros on hardware you will need in 12 months (or not at all …), because you buy that in 12 months.

So instead of planning ahead for 36 months you only plan for 3 to 6 months, and recheck after that time, if all assumptions still hold true and how the current systems are performing. If everything’s fine: You have done a great job! If the situation changed and you need to react on it: That’s ok, that’s what this model is for.

For me this looks pretty much like the “waterfall versus agile” discussion in the software development area. In the software development the benefits of being agile are often obvious, but in many projects the hardware needs to be allocated like in a traditional waterfall process. You do a sizing and you will never change it. And based on this sizing the hardware is provided.

Some reasons why this model of constant review of the system sizing is not taken:

  • When the upfront design and implementation phase are over, the architects and system designer will leave the project. The project is in regular production and you will have just minor changes and maintenance. In this phases you often have people, which are less skilled for this topic. And there’s no budget to allocate an architect resource for a regular session.
  • When the need for a change to the system sizing is detected and framed out, there is no budget available. Changes like this have never been planned.
  • Any reaction to new or changed requirements have not planned also for a changed system sizing as a consequence of these changed requirements.

Some reasons why you should switch to a model of constant review of the system sizing:

  •   Reduce the risk of planning wrong, be it too large or to small.
  •   Reduce the risk of requirements changing your basic assumptions for system sizing fundamentally

For sure this approach is something you cannot do for your project but it needs support from management, because it heavily influences budgets and budget discussions. So here are some ideas how you can get start to get there:

  • Clearly communicate any risk of the system sizing: unclear and changing requirements, wrong assumptions about external factors influencing this sizing calculation. Or just the uncertainty which any change to a yet unknown technology comes with.
  • Offer a number of variants: A single sizing which might be too large and expensive (to accommodate for some risks) vs sizing which can be changed and is much more accurate for the situation. Outline the differences in the price tags.
  • Start the project with a number of proof of concepts and try to explore the requirements which are considered the ones with the largest impact on infrastructure sizing.

SSO and basic authentication

Whenever you need to debug a SSO login on CQ5, because all of a sudden a basic authentication appears, make sure, that you don’t have HTTP basic authentication headers set next to your SSO cookie.

Because the Sling Authentication Service has set “Basic authentication (preemptive)” as default. Which means, whenever no other service feels responsible to extract authentication headers from the request (maybe because they are configured not to do this on a certain path), this service will try Basic Authentication. And just by the way: Setting this property to “Enabled” will kill the authentication of incoming replication requests.  So it’s best do eliminate the “Authentication” header at a dispatcher level and avoid to go through the dispatcher when doing replication.

 

Required skillset to learn CQ5

I just fell over the blog post by Robert Sumner where he describes what the required skillsets are to become a “WEM/WCM architect”, “J2EE integration specialist” and “CQ5 developer” for CQ5.

He states, that before you become an architect, you should have experience in the integration specialist role. Agreed to that. But then for the integration specialist role Robert says:

“J2EE – it’s all about JAVA here….experience with SSO, SEO, integration, databases, web services, and Caching is essential to implementing and delivering a robust WEMs tool.  These folks can learn the CQ5 technology in a day, because Adobe CQ5 is based on standards.”

In my opinion, this is simply wrong. You cannot learn CQ5 in a day, just because it is based on standards. It’s nice, that something is built on standards. But still: You need to learn to work with these standards, learn the APIs and the way they are supposed to be used. You need to learn the Do’s and Don’ts and the best practices. And even if CQ5 is built on standards: There are a lot of proprietary APIs on top of the standards you should at least be aware of.

So my answer to the question “What’s the required skill set” is: “Know HTTP! Know the impact of latency and the importance and ways of caching. Know some web frameworks and start to think in pages and request/response instead of transactions.” And if you know Java: even better!

Performance tests (4): Executing a test scenario

Now that we have prepared test scenarios, we have to execute them. Till here we did only some preparation work (but I cannot stress enough, that you really need to do this step!), but now you need to get your hands dirty. This means:

  • Decide which system you use to execute the test scenario. A developers laptop is usually not sufficient anymore to simulate a production-like scenario.. This information should be already contained in the test scenario description.
  • Implement the test scenario, so it is a machine executable format. Usually you record the specified user activities with a tool, modify them and then execute the recorded behavior using a tool.
  • Get test data. Having always the same test makes the analysis much easier.
  • Inform all parties, that you are running performance tests and that the results are sensible to any other activity on these systems.
  • Run the tests
  • Do the analysis.
  • React accordingly.

Let me point out a few important aspects of this execution. First, the choice of the systems to run the performance tests. I recommend to have them as similar to the production as possible. So, choosing the same hardware platform, same number of systems, same SAN, same content. This makes if much more easier, if someone asks the important question: “That’s all nice, but can we apply the results of this test also to our production environment?”. If you have identical hardware and the identical setup, it does not require much arguments to get to a answer everyone can agree on: “If we would run this scenario on production, we would get the same results”. And that’s a very important step later on when you need to decide about follow-up activities.

Secondly the test data. It is crucial, that if run a test multiple times, that you run it always on the same test data. This is not only required for the functional aspects (your most critical function regarding performance should not fail because of missing test data), but also the test data might impact performance. So it makes a difference, if you test your application on a 20 gigabyte repository or on a 200 gigabyte. repository. I usually recommend to create a copy of the instances before you start the first execution run of a performance scenario; before the execution of the second run, this copy will be restored, so we start the same point. This is important, so can you actually compare the results of multiple test runs.