Performance test modelling (part 5)

This is part 5 and the final post of the blog post series about performance test modelling; see part 1 for an overview and the links to all articles of this series.

In the previous post I discussed the impact of the system which we test, how the modelling of the test and the test content will influence the result of the performance test, and how you implement the most basic scenario of the performance tests.

In this blog post I want to discuss the predicted result of a performance test and the actual outcome of it, and what you can do when these do not match (actually they rarely do on the first execution). Also I want to discuss the situation where after golive you encounter that a performance test delivered the expected results, but did not match the observed behavior in production.

The performance test does not match the expected results

In my experience every performance, no matter how good or bad the basic definition is, contains at least 2 relevant data points:

the number of concurrent users (we discussed that already in part 1)
and an expected result, for example that the transaction must be completed within N seconds.

What if you don’t meet the performance criteria in point 2? This is typically the time when customers in AEM as a Cloud Service start to raise questions to Adobe, about number of pods, hardware details etc, as if the problem can only be the hardware sizing on the backend. If you don’t have a clear understanding about all the implications and details of your performance tests, this often seems to be the most natural thing to ask.

But if you have built a good model for your performance test, your first task should be to compare the assumptions with the results. Do you have your expected cache-hit ratio on the CDN? Were some assumptions in the model overly optimistic or pessimistic? As you have actual data to validate your assumptions you should do exactly that: go through your list of assumptions and check each one of them. Refine them. And when you have done that, modify the test and start another execution.

And at some point you might come to the conclusion, that all assumptions are correct, you have the expected cache-hit ratio, but the latency of the cache misses is too high (in which case the required action is performance tuning of individual requests). Or that you have already reduced the cache MISSES (and cache PASSES) to the minimum possible and that the backend is still not able to handle the load (in which case the expected outcome should be an upscale); or it can also be both.

That’s fine, and then it’s perfect to talk to Adobe, and share your test model, execution plan and the results. I wrote in part 1:

As you can imagine, if I am given just a few diagrams with test results and test statistics as preparation for this call with the customer … this is not enough, and very often more documentation about the test is not available. Which often leads to a lot of discussions about some very basic things and that adds even more delay to an already late project and/or bad customer experience.

But in this situation, when you have a good test model and have done your homework already, it’s possible to directly have a meaningful discussion without the need to uncovering all the hidden assumptions. Also, if you have that model at hand, I assume that performance tests are not an afterthought, and that there are still reasonable options to do some changes, which will either completely fix the situation or at least remediate the worst symptoms, without impacting the go-live and the go-live date too much.

So while this is definitely not the outcome we all work, design, build and ultimately hope for, it’s still much better than the 2nd option below.

I hope that I don’t need to talk about unrealistic expectations in your performance tests, for example delivering a p99,9 with 200 ms latency, while at the same time requiring a good number of requests always be handled by the AEM backend. You should have detected these unrealistic assumptions much earlier, mostly during design and then in the first runs during the evolution phase of your test.

Scenario 2: After go-live the performance is not what it’s supposed to be

In this scenario a performance test was either not done at all (don’t blame me for it!) or the test passed, but the results of the performance tests did not match the observed reality. This often shows up as outages in production or unbearable performance for users. This is the worst case scenario, because everyone assumed the contrary as the performance test results were green. Neither the business nor the developer team are prepared for it, and there is no time for any mitigation. This normally leads to an escalated situation with conference calls, involvement from Adobe, and in general a lot of stress for all parties.

The entire focus is on mitigation, and we ( I am speaking now as a member of the Adobe team, who is often involved in such situations) will try to do everything to mitigate that situation by implementing workarounds. As in many cases the most visible bottleneck is on the backend side, upscaling the backend is indeed the first task. And often this helps to buy you some time to perform other changes. But there are even cases, where an upscale of 1000% would be required to somehow mitigate that situation (which is possible, but also very short-lived, as every traffic spike on top will require additional 500% …); also it’s impossible to speed up the latency of a single-threaded request of 20 seconds by adding more CPU. These cases are not easy to solve, and the workaround often takes quite some time, and is often very tailored; and there cases where a workaround is not even possible. In any way it’s normally not a nice experience for no-one of the involved parties.

I refer to all of these actions as “workaround“. In bold. Because they are not not the solution to the challenge of performance problems. They cannot be a solution because this situation proves that the performance test was testing some scenarios, but not the scenario which shows in the production environment. It also raises valid concerns on the reliability of other aspects of the performance tests, and especially about the underlying assumptions. Anyway, we are all trying to do our best to get the system back to track.

As soon as the workarounds are in place and the situation is somehow mitigated, 2 types of questions will come up:

How does a long-term solution look like?
Why did that happen? What was wrong with the performance test and the test results?

While the response to (1) is very specific (and definitely out of scope of this blog post), the response to (2) is interesting. If you have a good documented performance test model you can compare its assumptions with the situation in which the production performance problem happened. You have the chance to spot the incorrect or missing assumption, adjust your model and then the performance test itself. And with that you should be able to reproduce your production issue in a performance test!

And if you have a performance failing test, it’s much easier to fix the system and your application, and apply some specific changes which fix this failed test. And it gives you much more confidence that you changed the right things to make the production environment handle the same situation again in a much better way. Interestingly, this gives also to some large extent the response to the question (1).

If you don’t have such a model in this situation, you are bad off. Because then you either start building the performance test model and the performance test from scratch (takes quite some time), or you switch to the “let’s test our improvements in production” mode. Most often the production testing approach is used (along with some basic testing on stage to avoid making the situation worse), but even that takes time and a high number of production deployments. While you can say it’s agile, other might say it’s chaos and hoping for the best… the actual opposite of good engineering practice.

Summary

In summary, when you have a performance test model, you are more likely to have less problems when your system goes live. Mostly because you have invested time and thoughts in that topic. And because you acted on it. It will not prevent you from making mistakes, forgetting relevant aspects and such, but if that happens you have a good basis to understand quickly the problem and also a good foundation to solve them.

I hope that you learned in these posts some aspects about performance tests which will help you to improve your test approach and test design, so you ultimately have less unexpected problems with performance. And if you have less problems with that, my life in the AEM CS engineering team is much easier 🙂

Thanks for staying with me for throughout this first planned series of blog posts. It’s a bit experimental, although the required structure in this topic led to some interesting additions on the overall structure (the first outline just covered 3 posts, now we are at 5). But I think that even that is not enough, I think that some aspects deserve a blog post on their own.

Performance test modelling (part 4)

This the 4th post of the blog post series about performance test modelling; see part 1 for an overview and the links to all articles of this series.

In the parts 2 and 3 I outlined relevant aspects when it comes to model your performance tests:

The modelling of the expected load, often as expressed as “concurrent users”.
The realistic modelling of the system where we want to conduct the performance tests, mostly regarding the relevant content and data.

In this blog post I want show how you deduce from that data, what specific scenarios you should cover by a performance tests. Because there is no single test, which tells you that the resulting end-user performance is good or not.

The basic performance test scenario

Let’s start with a very simple model, where we assume that the traffic rate is quite identical for the whole day; and therefor the performance test resembles that model:

On first sight this is quite simple to model, because you performance test will execute requests at a constant rate for the whole period of time.

But as I outlined in part 3, even if it seems that simple, you have to include at least some background noise. Also you have to take into account, that initially the cache-hit ratio is poor at the beginning, so you have to implement a cache-warmup phase (normally implement as a ramp-up phase, in which the load is increasing up the planned plateau) and just start to measure there.

So our revised plan rather looks like this this

Such a test execution (with the proper modelling of users, requests and requested data) can give you pretty good results if your model assumes a pretty constant load.

What about if your model requires you model a much more fluctuating request rate (for example if your users/visitors are primarily located in North America, and during the night you have almost no traffic, but it starts to increase heavily on the american morning hours? In that case you probably model the warmup in a way, that it resembles the morning increase on traffic, both in frequency and rate. That shouldn’t be that hard, but requires a bit more explicit modelling than just a simple rampup.

To give you some practical hints towards some basic parameters:

Such a performance test should run at least 2-3 hours, and even if you see that the results are not what you expect, not terminating it can reveal interesting results.
The warmup phase should at least cover 30 minutes; not only to give the caches time to warm-up, but also to give the backend systems time to scale to their “production sizing”; when you don’t execute performance test all the time, the system might scale down because there is no sense in having many systems idling when there is not load.
It can make sense to start not with the 100% of the targeted load, but with smaller numbers and start to increase from there. Because only then you can see the bottleneck which your test hits first. If you start already with 100% you might just see a lot of blockings, but you don’t know which one is the most impeding one.
When you are implementing a performance test in the context of AEM as a Cloud Service, I recommend to also use my checklist for Performance testing on AEM CS which gives some more practical hints how to get your tests right; although a few aspects covered there are covered in more depth in this post series as well.

When you have such a test passing the biggest part of the work is done; and based on your models you can do execute a number of different tests based to answer more questions.

Variations of the basic performance

The above model just covers an totally average day. But of course it’s possible to vary the created scenario to respond to some more questions:

What happens if the load of the day is not 100%, but for some reasons 120%, with identical assumptions about user behavior and traffic distribution? That’s quite simple, because you just increase a number in the performance test.
The basic performance test runs just for a few hours and stops then. It gives you the confidence that the system can operate at least these many hours, but a few issues might go unnoticed. For example memory leaks accumulating over time might get only visible after many hours of load. For that reason it makes sense to run your test for 24-48 hours continuously to validate that there is no degradation over that time.
What’s the behavior when the system goes into overload? An interesting question (but only if it does not break already when hitting the anticipated load) which is normally answered by a break test; then you increase the load more and more, until the situation really gets out of hand. If you have enough time, that’s indeed something you can try, but let’s hope that’s not very relevant 🙂
How does the system behave when your backend systems are not available? What if they come online again?

And probably many more interesting scenarios, which you can think of. But you should only perform these, when you have the basic version test right.

When you have your performance tests passing, the question is still: How does it compare to production load? Are we actually testing the right things?

In part 5 (the last post of this series) I cover the options you have when performance test does not match the expected results and also the worst case scenario: What happens if you find out after golive that your performance tests were good, but the production environment behaves very differently?

Performance tests modelling (part 3)

This is post 3 in my series about Performance Test Modelling. See the first post for an overview of this topic.

In the previous 2 posts I discussed the importance of having a clearly defined model of the performance tests, and that a good definition the load factors (typically measured by “concurrent users”) is required to build a realistic test.

In this post I cover the influence of the test system and test data on the performance test and its result, and why you should spend effort to create a test with a realistic set of data/content. In this post we will do a few thought experiments, and to judge the results of each experiment, we will use the cache-hit ratio of a CDN as a proxy metric.

Let’s design a performance test for a very simple site: It just consists of 1 page, 5 images and 1 CSS and 1 JS file; 8 files in total. Plus there is a CDN for it. So let’s assume that we have to test with 100, 500 and 1000 concurrent users. What’s the test result you expect?

Well, easy. You will get the same test result for all tests irrespective of the level of concurrency; mostly because after the first requests a files will be delivered from the CDN. That means no matter with what concurrency we test, the files are delivered from the CDN, for which we assume it will always deliver very fast. We do not test our system, but rather the CDN, because the cache hit ratio is quite close to 100%.

So what’s the reason why we do this test at all, knowing that the tests just validate the performance promises of the CDN vendor? There is no reason for it. The only reason why we would ever execute such a test is that on test design we did not pay attention to the data which we use to test. And someone decided that these 7 files are enough for satisfy the constraints of the performance test. But the results do not tell us anything about the performance of the site, which in production will consists of tens of thousands of distinct files.

So let’s us do a second thought experiment, this time we test with 100’000 files, 100 concurrent users requesting these files randomly, and a CDN which is configured to cache files for 8 hours (TTL=8h). With regard to to chache-hit-ratio, what is the expectation?

We expect that the cache-hit ratio starts low for quite some time, this is the cache-warming phase. And then it starts to increase, but it will never hit 100%, as after some time cache entries will expire on the cache and start produce cache-misses. This is a much better model of reality, but it still has a major flaw: In reality, requests are not randomly distributed, but normally there are hotspots.

A hotspot consists of files, which are requested much more often than average. Normally these are homepages or other landing pages, plus other pages which users normally are directed to. This set of files is normally quite small compared to the total amount of files (in the range of 1-2%), but they make up 40-60% of the overall requests, and you can easily assume a Pareto distribution (the famous 80/20 rule), that 20% of the files were responsible for 80% of the requests. That means we have a hotspot and a long-tail distribution of the requests.

If we modify the same performance test to take that distribution into account, we end up with a higher cache-hit ratio, because now the hotspot can be delivered mostly from the CDN. But on the long-tail we will have more cache-misses, because they are requested that rarely, so they can expire on the CDN without being requested again. But in total the cache-hit ratio will be better than with the random distribution, especially on the often-requested pages (which are normally the ones we care about most).

Let’s translate this into a graph which displays the response time.

This test is now quite realistic, and if we only focus on the 95 percentile (p95; that means if we take 100 requests, 95 of them are faster than this) the result would meet the criteria; but beyond that the response time is getting higher, because there are a lot of cache misses.

This level of realism in the test results comes with a price: Also the performance test model and the test preparation and execution are much more complicated now.

And till now we only considered users, but what happens when we add random internet noise and the search engines (the unmodelled users from the first part of this series) into the scenario? These will add more (relative) weight to the long-tail, because these requests do not necessarily follow the usual hotspots, but we have to assume a more random distribution for these.

That means that then the cache-hit ratio will be lower again, as there will be much more cache-misses now; and of course this will also increase the response time of the p95. And: it will complicate the model even further.

So let’s stop here. As I have outlined above, the most simple model is totally unrealistic, but making it more realistic makes the model more complex as well. And at some point the model is no longer helpful, because we cannot transform it into a test setup without too much effort (creating test data/content, complex rules to implement the random and hotspot-based requests, etc). That means especially in the case of the test data and test scenarios we need to find the right balance. The right balance between the investment we want to make into tests and how close it should mirror the reality.

I also tried to show you, how far you can get without doing any kind of performance test. Just based on some assumptions were able to build a basic understanding how the system will behave, and how some changes of the parameters will affect the result. I use this technique a lot and it helps me to quickly refine models and define the next steps or the next test iteration.

In part 4 I discuss various scenarios which you should consider in your performance test model, including some practical recommendations how to include them in your test model.

Performance tests modelling (part 1)

In my last blog post about performance test I outlined best practices about building and executing a performance test with AEM as a Cloud Service. But intentionally I left out a huge aspect of the topic:

How should your test look like?
What is a realistic test?
And what can a test result tell you about the behavior of your production environment?

These are hard question, and I often find that these questions are not asked. Or people are not aware that these questions should be asked.

This is the first post in a series of blog posts, in which I want to dive a bit deeper into performance testing in the context of AEM and AEM CS (and many aspects can probably get generalized to other web applications as well). Unlike my other blog posts it addresses topics on a higher level (I will not refer to any AEM functionality or API, and won’t even mention AEM that often), because I learned over time, that very often performance tests are done based on a lot of assumptions. And that it’s very hard to discuss the details of a performance tests if these assumptions are not documented explicitly. I had such discussions in these 2 contexts:

The result of a performance test (in AEM as a Cloud Service) is poor and the customer wants to understand what Adobe will do.
After golive severe performance problems show up on production; and the customer wants to understand how this can happen as their tests showed no problems.

As you can imagine, if I am given just a few diagrams with test results and test statistics as preparation for this call with the customer … this is not enough, and very often more documentation about the tests is not available. Which often leads to a lot of discussions about some very basic things and that adds even more delay to an already late project and/or bad customer experience. So you can also consider this blog series as a kind of self-defense. If you were asked to read this post, now you know 🙂

I hope that this series will also help you improve your way of doing performance tests, so we all will have less of these situations to deal with.

This post series consists of these individual posts:

Part 1 (this post): What is a performance test? Why do we have these tests at all?
Part 2: What is the relevance of “number of concurrent users”? What does it even mean?
Part 3: What is the impact of test system and the test data on the result of the test?
Part 4: What scenarios should be covered by a performance test
Part 5: What are your options if the test results do not match the observations in the production environment?

And a word upfront to the term “performance test”: I summarize a number of different tests types under that term, which are executed with different intentions, and which come with many names: “Performance tests”, “Load tests”, “Stress tests”, “Endurance tests”, “Soak tests”, and many more. Their intention and execution differ, but in the end they can all benefit from the same questions which I want to cover this blog series. So if you read “performance test”, all of these other tests are meant as well.

What is a performance test? And why do we do them?

A performance test is a tool to predict the future, more specifically how a certain system will behave in a more-or-less defined scenario.

And that outlines already two problems which performance tests have.

It is a prediction of the future. Unlike a science experiment it does not try to understand the presence and extrapolate into the future. It does not the have same quality as “tomorrow we will have a sunrise, even if the weather is clouded”, but rather goes into the direction of “if my 17 year old son wants to celebrate his birthday party with his friends at home, we better plan a full cleaning of the house for the day after”. That means no matter how well you know your son and his friends (or the system you are building), there is still an element of surprise and unknown in it.
The scenario which we want to simulate is somehow “defined”. In quotes, because in many cases the definitions of that scenario are pretty vague. We normally base these definitions on previous experience we have made and some best practices of the industry.

So it’s already clear from these 2 items, that this prediction is unlikely to be exact and 100% accurate. But it does not need to be accurate, it just needs to be helpful.

A performance test is helpful if it delivers better results than our gut feeling; and the industry has learned that our gut feeling is totally unreliably when it comes to the behaviour of web applications under production load. That’s why many enterprise golive procedures require a performance tests, which will always deliver a more reliable result as gut feeling. But just creating and executing a performance test does not make this a helpful performance test.

So a helpful performance test is also a test, which mimics the reality close enough, that you don’t need to change your plans immediately after your system goes live and hits reality. Unfortunately you only know if your performance test was helpful after you went live. It shares this situation with other test approaches as well; for example a 100% unittest coverage does not mean, that your code does not have bugs, it’s just less likely.

What does that mean for performance tests and their design?

First, a performance test is based on a mental model of your system and the to-be reality, which must be documented. All its assumptions and goals should be explicitly documented, because only then a review can be done. And a review helps to uncover blind spots in our own mental model of the system, its environment and the way how it is used. It helps to clearly outline all known factors which influence the test execution and also its result.

Without that model, it is impossible to compare the test result with reality and try to understand which factor or aspect in the test was missing, misrepresented or not fully understood, which lead to a gap between test result and reality. If you don’t have a documented model, it’s possible to question everything, starting from the model to the correct test execution and the results. If you don’t have a model, the result of a performance test is just a PDF with little to no meaning.

Also you must be aware that this mental model is a massive simplification, as it is impossible to factor in all aspects of the reality, also because the reality changes every day. You will change your application, new releases of AEM as a Cloud Service will be deployed, you add more content, and so on.

Your mental model will never be complete and probably also never be up-to-date, and that will be reflected in your performance test. . But if you know that, you can factor it in. For example you know that in 3 months time the number of content has doubled, and you can decide if it’s helpful to redo the performance test with changed parameters. It’s now a “known unknown”, and no more a “unknown unknown”. You can even decide to ignore factors, if they deem not relevant to you, but of course you should document it.

When you have designed and documented such a model, it is much easier to implement the test, execute the test and reason about the results. Without such a model, there is much more uncertainties in every piece of the test execution. It’s like developing software without a clear and shared understanding what exactly you want to develop.

That’s enough for this post. As promised, this is more abstract than usual, but I hope you liked it and it helps to improve your tests. In part 2 I look into a few relevant aspects which should be covered by your model.