healthcheck 1.0 now available via public maven repository

Finally we made it. Thanks to Alex the healthcheck binaries are available now in the public sonatype repository. If you have already included the this repository to your pom.xml, you only need to add these dependencies to your pom.xml:

<dependency>
<groupId>de.joerghoh.cq5.healthcheck</groupId>
  <artifactId>core</artifactId>
  <version>1.0.0</version>
  <type>pom</type>
</dependency>
<dependency>
  <groupId>de.joerghoh.cq5.healthcheck</groupId>
  <artifactId>core</artifactId>
  <version>1.0.0</version>
  <type>pom</type>
</dependency>

I also recommend you to add the jmx-extension bundle, as it adds some really useful MBeans.

<dependency>
  <groupId>de.joerghoh.cq5.healthcheck</groupId>
  <artifactId>jmx-extension</artifactId>
  <version>1.0.0</version>
  <type>pom</type>
</dependency>

We will continue the development over at github, we appreciate your feedback.

Performance tests (3): Test scenarios

The overall performance test is usually grouped into smaller test chunks (I call them scenarios), which replicate certain aspects of the daily operation like:

  • normal operation
  • many people login into the system during the start of the business hours
  • your marketing campaign got viral and now huge traffic spikes come over from Twitter/Facebook/G+/…
  • how can your system handles traffic when some are down for maintenance?
  • you are preparing a huge amount of new content on authoring
  • (and many more …)

Collect the scenarios you know you will have, and detail them out. If you have historical data on them, use them. If you don’t have such data, start with rough estimations. You will adjust them afterwards anyway by any feedback of your production system.

A performance test scenario consists at least of these items:

  1. Objectives what should be tested (the “questions” of part 2); this objectives should detailled and clear, usually they are oriented on the aspect of the daily operations you just identified.
  2. A number of test cases, which are executed during the performance test; in most real-life situations multiple different activities are carried out in parallel, and your performance test scenarios should reflect that.
  3. Systems and environments, where the tests are executed and can deliver meaningful results.
  4. Facilities to monitor the systems, which are involved in the test execution; it should be described, which information is considered valuable for this test scenario and which should preserved for analysis.
  5. Instructions how to interpret the monitoring data gathered throughout the test execution.
  6. Instructions how the test should be executed.
  7. In the end there should be a clear answer if the objectives have been met.

All of the aspects should be covered. And now it is clear, why good performance test is nothing which can done without preparation and within 1 day. Such adhoc tests usually never deliver proper results, because they heavily focus on the test execution and less on proper preparation of the test cases and the interpretation of the results.

Even if you plan much more scenarios than you will ever execute: Just thinking of them, planning and defining them often gives you valuable input for your application design. As in many cases functional test planning makes you think about system, about expected behaviour and error handling. It’s the same here.

(Applying the TDD mechanisms to performance tests would be an interesting exercise …)

Performance tests (part 2): What can you expect from a performance test?

Performance tests sometimes have a myth around them, depending on the personal experience people have made with them. For some it’s just a waste of time and effort, because the last test they conducted has not delivered any value. For others it’s the best since sliced bread.

With performance tests it’s the same as with every test: You need to ask a question and model an appropriate test case, then you get an answer for it. A performance test cannot tell you things about your system, which you have not asked. Especially it cannot give you any guarantee, that you system is rock-solid and fast (even if you ask this question :-))
It is not a silver bullet for any problem. But it is an important step to a point where you can start to trust your system to function properly.

So, what’s this “asking”? In Performance testing you ask questions by designing and executing a test case. When your test case is like this “Execute 100 requests to this specific URL with 10 concurrent threads, the response should be received within 500 milliseconds, no errors are allowed”, the question could be like this “Can my system copy with 10 concurrent users and respond to them within 0.5 seconds with no  errors”. The execution of the performance test can then answer the question with “yes” or “no”, depending on the outcome.

If the outcome is “no”, we obviously have to change this and work on our systems and our code to get the answer we are aiming for: “ok”. But our success is limited then. Even in the “ok” case we cannot infer, that with 15 concurrent users the system is still working ok, but slower than with 10 concurrent requests. This is an assumption we often do because we have only a limited set of validated results, and technically we should validate them all. So every test case answers exactly one question, gives us only one piece of information about the performance behavior of the system. For all other questions we have to create different performance test cases. (There will be a follow-up article on this topic…)

So, as a conclusion, the quality of the results of your performance result is directly depending on the questions you ask. For this reason you should invest a good amount for the preparation of your performance tests to define the right questions.

Why you should do performance tests

With this posting I want to start a series of small articles which deal with performance tests and performance tests in projects with CQ5. It can be applied to many other web projects.

This topic is one of my all-time favorites in a project plan. According to my experience with CQ projects in the last yearsperformance tests are usually planned to be executed 3 weeks before the golive, with some person days of preparation planned. Noone knows, what should be tested to what extent, and also noone knows what the results should be. And performance tests are also one the first activities removed from the project plan if time is getting short.

But of course most members (mostly the technical ones) of the project team know, that dropping the performance tests will cause heavy problems later in production. But the “told you so” afterwards isn’t really helpful …

Some reasons, why you should do a performance test:

  • During development the focus is on coding and functionality. Functional tests are executed by developers or small test teams. In a production use case the number of users is usually much higher. Just working on the assumption, that the system works properly, is highly risky.
  • When you have done a performance test, you should know some conditions, under which the system performs well or fails. In both cases you learn about your system and you can either address deficits early or reduce the risk of performance assumptions.
  • Operation will have an easier job, if you can tell them about the capabilities or limits of the system. Then they can act accordingly in case of problems and don’t need to waste effort to determine root causes and issues of your shiny new application. (In the end they cannot do much about it, especially on a production system. They need to bring it back online as soon as possible, they won’t debug it.)
  •  Performance deficits sometimes go hand in hand with stability issues. Consider a request, which requires some memory to execute. If you have 5 of such requests in parallel, it is not a problem, but in case you have 100 concurrently of them (due to the slow processing of them), the stability of your JVM might be impacted.
  •  The business aspect: A bad performance can slow down your business performance and have a huge impact on your KPIs like successful conversions. If you know, that certain parts of your site are slow, you can address these deficits early in the project, long before they impact the KPIs.

So from a risk-perspective it is very helpful to conduct performance tests. If the discussion is about cutting a feature or a performance test, strongly recommend the feature (unless it’s one if the features of the minimum-viable product). From an overall stress-level it is usually much easier to add a feature after golive than to have a  release, which has performance and/or stability problems.

JCR sessions and CQ5

The JCR 2.0 specifications says about sessions:

A user connects to a repository by passing a set of credentials and the name of the workspace that the user wishes to access. The repository returns a session which binds the user to the requested persistent workspace with a level of authorization determined by that user’s credentials. A session is always bound to exactly one persistent workspace, though a single persistent workspace may be bound to multiple sessions.

In the implementation of Jackrabbit/CRX it means that the session enforce authorization; if a user is not authorized to see nodes, the session will not expose them to the user. It’s just as they are not there. Workspace handling is in the case of CQ5 not really relevant, as only the “crx.default” workspace is used.

But what does this mean for your work with CQ5?

  • So first of all, sessions are an integral concept of security within CQ5. A normal user session is different from an admin session. In a user session normally less nodes than in an admin session are visible. You cannot just switch the user context within a session, but you create a new one.
  • In an admin session you are not restricted at all, you can do everything.
  • In the context of requests to templates and components, you will work with an authenticated session (or anonymous sessions in the standard publish case). Sling authenticates the request and provides you an appropriately authenticated session, and closes this specific session when the request processing is done. Do not close this session!
  • In a request scope you normally should not need to create a new session. You should leverage the session associated with the request.
  • Not closing a single session isn’t a problem, but not closing many session will cause a serious memory leak.
  • Sessions are cheap. You can create as many of them as you need. So it’s perfectly ok to open a session to change a few properties and then close that session again. Try to have sessions as short-lived as possible.
  • The only exception from this rule are sessions, in which you have registered a JCR observation listener. But use these sessions only for reading. If you need to write to the repository in an observation listener, create a new session for these write operations.
  • In most cases Unix daemons work in the context of a specialized user to limit the impact of a security breach through this daemon. You should act alike and use specialized user sessions if you work with JCR sessions. For 99% of all usecases you should not use an admin session!
  • If you want to run a session with a less privileged user than the admin, you should use the impersonation feature of JCR. Do not hardcode any user password in your code. This code fragment shows how you can use impersonation to create a session with a different user than admin, but without hardcoding the password of that user:
private static final RUNAS_USER="technical_user";
@Reference
SlingRepository repo;
Session adminSession = null;
Session userSession = null;
try {
  Session adminSession = repo.loginAdministrative(null);
  userSession = adminSession.impersonate(new SimpleCredentials(RUNAS_USER,"".toCharArray()));
  adminSession.logout();
  adminSession = null;
  doSomethingWith(userSession);
} catch (RepositoryException e) {
  // report exception
} finally {
  if (adminSession != null) {
    adminSession.logout();
  }
  if (userSession != null) {
    userSession.logout();
  }
}

Update March 19th 2013: My colleague Alex Klimetschek recommended me to add this statement: “Admin sessions sound so useful. But they are dangerous!”
Update April 9th 2013: SimpleCredentials take in the constructor a char array as password, not a String. Thanks Anand.

CQ5 healthcheck: backport for CQ 5.4

I learned, that there a quite a number of projects out there, which are (still?) bound to CQ 5.4 and cannot move forward to a newer version right now. For these I created a backport of the healthcheck version 1.0, which works reasonable well on my personal instance of CQ 5.4. You can find the code on github in the release-1.0-cq54 branch, but I don’t provide a compiled binary version.

The main changes to the master branch:

  • I backported the 1.0 branch, not master. Currently the changes aren’t that hard, so you can maintain a branch “master-cq54” on your own.
  • Adjusted pom files; no code changes required due to this, but only
  • The PropertiesUtil class is not there, but you can replace 1:1 with the OsgiUtil class available in CQ 5.4
  • use “sling:OsgiConfig” nodes instead of nt:files nodes with the extension “.config” (the later is available on CQ 5.5 and later)
  • CQ 5.4 does not support sub-folders within the config folder, you need to put all config nodes there.

And of course the biggest limitation:

  • For replication there is no ootb JMX support, therefor I dropped the respective config nodes.
  • If you want to contribute support for this feature, you’re welcome 🙂

So have fun with it.

Logging best practices

In between all the hyping of the CQ5 healthcheck I want to tell you some bits which has more impact on your daily life as a developer. Today I want to tell you some bits and pieces about logging.

Logging is one of the unloved children of a web project. It’s mostly never stated in the project requirements, but every developer adds logging statements. Everybody has some ideas, what should be logged, but it is never explicitly defined. While I base my recommendations on CQ5 projects, they can be applied widely. And for Java there are — of course — lot of great postings out there, which have tips and tricks on that topic. I just found 10 tips for proper application logging. Sounds pretty good.

But here goes my personal list of best practices (all of them pretty obvious, right?)

  • Use the SLF4J logging framework, as it is the standard logging for sling; there’s no need to introduce another logging framework.
  • Do not use the log levels INFO,WARN and ERROR for debugging your code during development. You won’t have the time to fix them, and then you will pollute the production log with your debugging statements. Your operation people will hate for that, because the CQ5 standard loglevel is “INFO”.
  • Instead use the loglevel “DEBUG” and a special logging configuration for your development environment, which logs your debug statements.
  • Cleanup your debug statements when you are done with development. In production environment I recommend to setup logging on level DEBUG if there are problems with some parts of the application. The last thing I want to see are statements like “loop1”, “Inside for-loop” and “1234” (yes, I have seen them.)
  • Ask yourself: Does a piece of code, which does not have logging statements, really perform any function? Even it is perfect written code with no bugs it is sometimes interesting, what information are processed. So every code should contain debug statements.
  • Stacktraces are sometimes a valuable asset. Which also means, that not always a full stacktrace is required.
  • And the most important: Do not use “System.out.println()”!

If you ask yourself “What should I log”, change perspective. Just imaging that someone strongly assumes an issue in your code. What information do you need to investigate on this? — Add this information in proper DEBUG statements.

So, that’s for today’s lesson. Stay tuned! CQ 5.6 is coming …

CQ5 healtcheck — how to use

The recent announcement of my healthcheck project caused some buzz, most related to how it can be used. So I want to show you, how you can leverage the framework for you.

The statuspage

First, the package already contains a status page (reachable via <host>/content/statuspage.html), which looks like this:

screenshot CQ5 healthcheck
The first relevant piece of information is the “Overall Status”: It can be “OK”,”WARN” or “CRITICAL”.

This information is computed out of all the invidual checks which are listed in the details table according to this ruleset:

  • If a least 1 check returns CRITICAL, the overall status is “CRITICAL”.
  • If at least 1 check returns WARN an no check returns CRITICAL, the overall status is “WARN”.
  • If all status return OK, the overall status is “OK”.

The overall status is easily parseable on the statuspage by a monitoring system.

The indivual checks are listed by name, status and an optional message. This list should be used to determine which check failed and caused the overall status to deviate from OK.

The status in detail:

  • OK: obvious, isn’t it?
  • WARN: the checked status is not “OK, but also not CRITICAL. The system is still usable, but you need to observe the system closer, or need to perform some actions, so the situation won’t get worse.
  • CRITICAL: The system should not be used and user experience will be impacted. Actions required.

Managing the loadbalancers

Any loadbalancer in front of CQ5 instances also should be aware of the status of the instance. But loadbalancers probes much more often (about every 30 seconds), and they don’t have that much capabilities to parse complex data. For this usecase there is the “/bin/loadbalancer” servlet, which returns only “OK” with a statuscode 200, or “WARN” with a statuscode “500”. WARN indicates both WARN and CRITICAL, in both cases it is assumed, that the loadbalancer should not send requests to that instance.

That’s for now. If you have feedback, mail me or just create an issue at github.

CQ5 healthcheck version 1.0

My colleague Alex already disclosed it already in early December, but it was still not ready. But in the meantime I think, that it’s actual time to release it.

So here it is: The CQ5 health check. A small and easy to understand framework to monitor the status of your CQ5 instance. It’s main features are

  •    Usable out of the box.
  •    All MBeans can be monitored just by configuration
  •    Extendable by simple OSGI services
  •    Features an extended status page as well as a machine interface for automatic monitoring

And the best: the source is freely available on Github; a package ready for installation is available on Packageshare , and the installation is very easy:

  1.    Download and install the package from package share
  2.    Goto http://<your-instance&gt;:4502/content/statuspage.html
  3.    Enjoy

So feel free to download it,install it, fork it, extend it. The code is licensed under Apache License, so you don’t have to disclose your extensions and modifications at all. But I love to get contributions back 🙂

So, currently the most useful informations are stored in the README file, but I hope that I can move this information over to the project wiki. This is just the announcement; I plan to add some posts to this blog how you can write your own health checks (which isn’t hard by the way).

Enjoy your new toy, and I love to get feedback from you, either here on the blog, via twitter (@joerghoh) or in geek-style via pull-requests.

And many thanks to Alex Saar and Markus Haack for they support and contributions.

Why is it hard to do disk size estimations upfront?

Whenever a new project starts, a project manager is responsible to do a project sizing, so that the right amount of people with the right skills are assigned to the project. In many projects another early task is to size the hardware. This has mostly to do with the time to buy and deploy new hardware, which can be pretty long. On one project I did it took the IT 10 months (!!) from the decision to buy 8 of “these” boxes until they have been able to login on that box. And by the way, this was the regular hardware purchasing process with no special cases …

Anyway, even if takes only 6 weeks for the whole “new hardware purchase and deployment” process, you cannot just start and determine then what hardware is needed. When development starts, a development infrastructure must be provided, consisting of a reasonable amount of systems with enough available resources. So one of the earliest project tasks is an initial system sizing.

If you have done it a few times for some specific types of projects (for example CQ5 projects) you can give some basic sizing, without doing major calculations; at that time you usually doesn’t have enough information to do  a calculation at all. So for a centralized development system (that’s where the continous integration server deploys to) my usual recommendation is “4 CPUs, 8-12G RAM, 100G free disk; this for 1 author and 1 publish”. This is a reasonable system to actually run development on. (Remember, that each developer has on her laptop also an authoring and publishing system deployed, where they actually try out their code. On these central developement systems all development tests are executed, as well as some integration tests.)

This gets much harder, if we talk about higher environments like staging/pre-production/integration/test (or however you might call it) and — of course — production. Because there we have much more content available. This content is the most variable factor in all calculations, because in most requirement documents it is not clear, how much content will be in the system in the end, how much assets will be uploaded, how often they are changed, and when they will expire and can be removed. To be honest, I would not trust any number given in such a document, because this usually changes over time and even during the very first phase of the project. So you need to be flexible regarding content and also regarding the disk space calculation.

My colleague Jayan Kandathil posted a calculation for the storage consumption of images in the datastore. It’s an interesting formula, which might be true (I haven’t validated the values), but I usually do not rely on such formulas because:

  • We do not only upload images to DAM, and besides the datastore we also have the TarPM and the Lucene index which contribute to the overall repository growth.
  • I don’t know if there will be adjustments to the “Asset update” workflow, especially if more/less/changed renditions will be created. With CQ 5.5 also any change of asset metadata will affect the datastore (XMP writeback changes the asset binary! This results in a new file in the datastore!).
  • I don’t know if I can rely on the numbers given in the requirements document.
  • There is a lot of other content in the repository, which I usually cannot estimate upfront. So the datastore consumption of the images is only a small fraction of the overall disk space consumption of the repository.

So instead of calculating the disk size based only on assumptions, I usually tell a disk size to start with. This number is soo high, that they won’t fill it up within the first 2-3 months. But it is also not that large, that they will never ever reach 100% of its size. It’s somewhat in between. So they can go live with it and they need to monitor it. Whenever they reach 85% of the disk size, IT has to add more disk space. If you run this process for some time, you can do a pretty good forecast on the repository growth and react accordingly by attaching more disk space. I cannot do this forecast upfront, because I don’t have any reliable numbers.

So, my learning from this: I don’t spend that much time in disk calculations upfront. I only give the customer a model, and based on this model they can react and attach storage in a timely manner. Also this is the most cheapest version, because you attach storage only when it’s really needed and not based on some unreliable calculation.