Performance tests modelling (part 2)

This is is the second blog post in the series about performance test modelling. You can find the overview over this series and links to all its articles in the post “Performance tests modelling (part 1)“.

In this blog post I want to cover the aspect of “concurrent users”, what it means in the context of a performance test and why its important to clearly understand its impact.

Concurrent users is an often used measure to indicate the the load put to a system, expressed by usage in a definition, how many users are concurrently using that system. And for that reason many performance tests provide as quantitative requirement: “The system should be able to handle 200 concurrent users”. While that seems to be a good definition on first sight, it leaves many questions:

  • What does “concurrent” mean?
  • And what does “user” mean?
  • Are “200 concurrent users” enough?
  • Do we always have “200 concurrent users”?

Definition of concurrent

Let’s start with the first question: What does “concurrent” really mean on a technical level? How can we measure that our test indeed does “200 concurrent users” and not just 20 or 1000?

  • Are there any server-side sessions which we can count and which directly give this number? And that we setup our test in a way to hit that number?
  • Or do we have to rely on more vague definitions like “users are considered concurrent when they do a page load less than 5 minutes apart”? And that we design our test in that way?

Actually it does not matter at all, which definition you choose. It’s just important that you explicitly define which definition you use. And what metric you choose to understand that you hit that number. This is an important definition when it comes to implementing your test.

And as a side-note: Many commercial tools have their own definition of concurrent, and here the exact definition does not matter as well, as long as you are able to articulate it.

What is a user?

The next question is about “the user” which is modeled in the test; to simplify the test and test executions one or more “typical” user personas are created, which visit the site and perform some actions. Which is definitely helpful, but it’s just that: A simplification, because otherwise our model would explode because of the sheer complexity and variety of user behavior. Also sometimes we don’t even know what a typical “user” does on our site, because that system will be brand-new.

So this is a case, where we have a huge variance in the behavior of the users, which we should outline in our model as a risk: The model is only valid if the majority of the users are behaving more or less as we assumed.

But is this all? Are really all users do at least 10% of the actions we assume they do?

Let’s brainstorm a bit and try to find answers for these questions:

  • Does the google bot behave like that? All the other bots of the search engines?
  • What about malware scanners which try to hit a huge list of WordPress/Drupal/… URLs on your site?
  • Other systems performing (random?) requests towards your site?

You could argue, that this traffic has less/no business value, and for that reason we don’t test for it. Also it could be assumed that this is just a small fraction of the overall user traffic, and can be ignored. But that is just an assumption, and nothing more. You just assume that it is irrelevant. But often these requests are not irrelevant, not all all.

I encountered cases where not the “normal users” were bringing down a system, but rather this non-normal type of “user”. An example for that are cases where the custom 404 handler was very slow, and for that reason the basic undocumented assumption “We don’t need to care about 404s, as they are very fast” was violated and brought down the site. All performance tests passed, but the production system failed nevertheless.

So you need to think about “user” in a very broad sense. And even if you don’t implement the constant background noise of the internet in your performance test, you should list it as factor. If you know that a lot of this background noise will trigger a HTTP statuscode 404, you are more likely to check that this 404 handler is fast.

Are “200 concurrent users” enough?

One information every performance has is the number of concurrent users which the system must be able to handle. But even if we assume, that “concurrent” and “users” are both defined as well, is this enough?

First, on what data is this number based on? Is it a number based on data derived from another system, which the new system should replace? That’s probably the best data you can get. Or when you build a new system, is it based on good marketing data (which would be okay-ish), based on assumptions of the expected usage or just numbers we would like to see (because we assume that a huge number of concurrent users means a large audience and a high business value)?

So probably this is the topic which will be discussed the most. But the number and the way how that number is determined should be challenged and vetted. Because it’s one the corner-stones of the whole performance test model. It does not make sense to build a high performance and scalable system when afterwards you find out that the business numbers we grossly overrated, and a smaller and cheaper solution would have delivered the same results.

What about time?

A more important is aspect which is often overlooked is the timing; how many users are working on the site at every moment? Do you need to expect the maximum number 8 hours every day or just during the peak days of the year? Do you have a more or less constant usage or only during business hours in Europe?

This heavily depends on the type of your application and the distribution of your audience. If you build an intranet site for a company only located in Europe, the usage during the night is pretty much “zero”, and it will start to increase at 0600 in the morning (probably the Germans going to work early :-)), hitting the max usage between 09 and 16 o’clock and going to zero at latest at 22 o’clock. The contrast to it is a site visited world-wide by customers, where we can expect a higher and almost flat line; of course with variations depending on the number of people being up.

This influences your tests as well, because in both cases you don’t need to simulate spikes, that means a 500% increase of users within 5 minutes. On the other hand, if you plan for large marketing campaigns addressing millions of users, this might exactly be the situation you need to plan and test for. Not to mention if you book a slot during the Superbowl break.

Why is this important? Because you need to test only scenarios which you will expect to see in production. And ignore scenarios which we don’t have any value for you. For example it’s a waste of time and investment to test for a sudden spike in the above mentioned intranet case for the European company, while it’s essential for marketing campaigns to test a scenario, where such a spike comes on top of the normal traffic.

Summary

“N concurrent users” itself is not much information; and while it can serve as input, your performance test model should contain a more detailed understanding of that definition and what it means to the performance test. Otherwise you will focus just on a given number of users of this idealistic type and ignore every other scenario and case.

In the next blog post I will cover how the system and the test data itself will influence the result of the performance test.

One thought on “Performance tests modelling (part 2)

Comments are closed.