AEM CS & Mongo exceptions

If you are an avid log checker on your AEM CS environments you might have come across messages like this in your authoring logs:

02.04.2024 13:37:42:1234 INFO [cluster-ClusterId{value='6628de4fc6c9efa', description='MongoConnection for Oak DocumentMK'}-cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net:27017] org.mongodb.driver.cluster Exception in monitor thread while connecting to server cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net:27017 com.mongodb.MongoSocketException: cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net 
at com.mongodb.ServerAddress.getSocketAddresses(ServerAddress.java:211) [org.mongodb.mongo-java-driver:3.12.7] 
at com.mongodb.internal.connection.SocketStream.initializeSocket(SocketStream.java:75) [org.mongodb.mongo-java-driver:3.12.7] 
...
Caused by: java.net.UnknownHostException: cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net

And you might wonder what is going on. I get this question every now and then, often assuming that this something problematic. Because we have all learned that stacktraces normally indicate problems. And on first sight this indicates a problem, that a specific hostname cannot be resolved. Is there a DNS problem in AEM CS?

Actually this message does not indicate any problem. The reason behind this is the way how mongodb implemented scaling operations. If you up- or downscale the mongo cluster, this does not happen in-place, but you get actually a new mongo cluster of the new size and of course the same content. And this new cluster comes with a new hostname.

So in this situation there was a scaling operation, and AEM CS connected to the new cluster and now looses connection to the old cluster, because the older cluster is stopped and its DNS entry is removed. Which is of course expected. And for that reason you can also see that this is logged on level INFO, and not as an ERROR.

Unfortunately this is a log message created by the mongo-driver itself, so this cannot be changed on the Oak level by removing the stacktrace from this message and changing the message itself. And for that reason you will continue to see it in the AEM CS logs, until a new improved mongo driver changes that.

Performance test modelling (part 5)

This is part 5 and the final post of the blog post series about performance test modelling; see part 1 for an overview and the links to all articles of this series.

In the previous post I discussed the impact of the system which we test, how the modelling of the test and the test content will influence the result of the performance test, and how you implement the most basic scenario of the performance tests.

In this blog post I want to discuss the predicted result of a performance test and the actual outcome of it, and what you can do when these do not match (actually they rarely do on the first execution). Also I want to discuss the situation where after golive you encounter that a performance test delivered the expected results, but did not match the observed behavior in production.

The performance test does not match the expected results

In my experience every performance, no matter how good or bad the basic definition is, contains at least 2 relevant data points:

the number of concurrent users (we discussed that already in part 1)
and an expected result, for example that the transaction must be completed within N seconds.

What if you don’t meet the performance criteria in point 2? This is typically the time when customers in AEM as a Cloud Service start to raise questions to Adobe, about number of pods, hardware details etc, as if the problem can only be the hardware sizing on the backend. If you don’t have a clear understanding about all the implications and details of your performance tests, this often seems to be the most natural thing to ask.

But if you have built a good model for your performance test, your first task should be to compare the assumptions with the results. Do you have your expected cache-hit ratio on the CDN? Were some assumptions in the model overly optimistic or pessimistic? As you have actual data to validate your assumptions you should do exactly that: go through your list of assumptions and check each one of them. Refine them. And when you have done that, modify the test and start another execution.

And at some point you might come to the conclusion, that all assumptions are correct, you have the expected cache-hit ratio, but the latency of the cache misses is too high (in which case the required action is performance tuning of individual requests). Or that you have already reduced the cache MISSES (and cache PASSES) to the minimum possible and that the backend is still not able to handle the load (in which case the expected outcome should be an upscale); or it can also be both.

That’s fine, and then it’s perfect to talk to Adobe, and share your test model, execution plan and the results. I wrote in part 1:

As you can imagine, if I am given just a few diagrams with test results and test statistics as preparation for this call with the customer … this is not enough, and very often more documentation about the test is not available. Which often leads to a lot of discussions about some very basic things and that adds even more delay to an already late project and/or bad customer experience.

But in this situation, when you have a good test model and have done your homework already, it’s possible to directly have a meaningful discussion without the need to uncovering all the hidden assumptions. Also, if you have that model at hand, I assume that performance tests are not an afterthought, and that there are still reasonable options to do some changes, which will either completely fix the situation or at least remediate the worst symptoms, without impacting the go-live and the go-live date too much.

So while this is definitely not the outcome we all work, design, build and ultimately hope for, it’s still much better than the 2nd option below.

I hope that I don’t need to talk about unrealistic expectations in your performance tests, for example delivering a p99,9 with 200 ms latency, while at the same time requiring a good number of requests always be handled by the AEM backend. You should have detected these unrealistic assumptions much earlier, mostly during design and then in the first runs during the evolution phase of your test.

Scenario 2: After go-live the performance is not what it’s supposed to be

In this scenario a performance test was either not done at all (don’t blame me for it!) or the test passed, but the results of the performance tests did not match the observed reality. This often shows up as outages in production or unbearable performance for users. This is the worst case scenario, because everyone assumed the contrary as the performance test results were green. Neither the business nor the developer team are prepared for it, and there is no time for any mitigation. This normally leads to an escalated situation with conference calls, involvement from Adobe, and in general a lot of stress for all parties.

The entire focus is on mitigation, and we ( I am speaking now as a member of the Adobe team, who is often involved in such situations) will try to do everything to mitigate that situation by implementing workarounds. As in many cases the most visible bottleneck is on the backend side, upscaling the backend is indeed the first task. And often this helps to buy you some time to perform other changes. But there are even cases, where an upscale of 1000% would be required to somehow mitigate that situation (which is possible, but also very short-lived, as every traffic spike on top will require additional 500% …); also it’s impossible to speed up the latency of a single-threaded request of 20 seconds by adding more CPU. These cases are not easy to solve, and the workaround often takes quite some time, and is often very tailored; and there cases where a workaround is not even possible. In any way it’s normally not a nice experience for no-one of the involved parties.

I refer to all of these actions as “workaround“. In bold. Because they are not not the solution to the challenge of performance problems. They cannot be a solution because this situation proves that the performance test was testing some scenarios, but not the scenario which shows in the production environment. It also raises valid concerns on the reliability of other aspects of the performance tests, and especially about the underlying assumptions. Anyway, we are all trying to do our best to get the system back to track.

As soon as the workarounds are in place and the situation is somehow mitigated, 2 types of questions will come up:

How does a long-term solution look like?
Why did that happen? What was wrong with the performance test and the test results?

While the response to (1) is very specific (and definitely out of scope of this blog post), the response to (2) is interesting. If you have a good documented performance test model you can compare its assumptions with the situation in which the production performance problem happened. You have the chance to spot the incorrect or missing assumption, adjust your model and then the performance test itself. And with that you should be able to reproduce your production issue in a performance test!

And if you have a performance failing test, it’s much easier to fix the system and your application, and apply some specific changes which fix this failed test. And it gives you much more confidence that you changed the right things to make the production environment handle the same situation again in a much better way. Interestingly, this gives also to some large extent the response to the question (1).

If you don’t have such a model in this situation, you are bad off. Because then you either start building the performance test model and the performance test from scratch (takes quite some time), or you switch to the “let’s test our improvements in production” mode. Most often the production testing approach is used (along with some basic testing on stage to avoid making the situation worse), but even that takes time and a high number of production deployments. While you can say it’s agile, other might say it’s chaos and hoping for the best… the actual opposite of good engineering practice.

Summary

In summary, when you have a performance test model, you are more likely to have less problems when your system goes live. Mostly because you have invested time and thoughts in that topic. And because you acted on it. It will not prevent you from making mistakes, forgetting relevant aspects and such, but if that happens you have a good basis to understand quickly the problem and also a good foundation to solve them.

I hope that you learned in these posts some aspects about performance tests which will help you to improve your test approach and test design, so you ultimately have less unexpected problems with performance. And if you have less problems with that, my life in the AEM CS engineering team is much easier 🙂

Thanks for staying with me for throughout this first planned series of blog posts. It’s a bit experimental, although the required structure in this topic led to some interesting additions on the overall structure (the first outline just covered 3 posts, now we are at 5). But I think that even that is not enough, I think that some aspects deserve a blog post on their own.

Performance test modelling (part 4)

This the 4th post of the blog post series about performance test modelling; see part 1 for an overview and the links to all articles of this series.

In the parts 2 and 3 I outlined relevant aspects when it comes to model your performance tests:

The modelling of the expected load, often as expressed as “concurrent users”.
The realistic modelling of the system where we want to conduct the performance tests, mostly regarding the relevant content and data.

In this blog post I want show how you deduce from that data, what specific scenarios you should cover by a performance tests. Because there is no single test, which tells you that the resulting end-user performance is good or not.

The basic performance test scenario

Let’s start with a very simple model, where we assume that the traffic rate is quite identical for the whole day; and therefor the performance test resembles that model:

On first sight this is quite simple to model, because you performance test will execute requests at a constant rate for the whole period of time.

But as I outlined in part 3, even if it seems that simple, you have to include at least some background noise. Also you have to take into account, that initially the cache-hit ratio is poor at the beginning, so you have to implement a cache-warmup phase (normally implement as a ramp-up phase, in which the load is increasing up the planned plateau) and just start to measure there.

So our revised plan rather looks like this this

Such a test execution (with the proper modelling of users, requests and requested data) can give you pretty good results if your model assumes a pretty constant load.

What about if your model requires you model a much more fluctuating request rate (for example if your users/visitors are primarily located in North America, and during the night you have almost no traffic, but it starts to increase heavily on the american morning hours? In that case you probably model the warmup in a way, that it resembles the morning increase on traffic, both in frequency and rate. That shouldn’t be that hard, but requires a bit more explicit modelling than just a simple rampup.

To give you some practical hints towards some basic parameters:

Such a performance test should run at least 2-3 hours, and even if you see that the results are not what you expect, not terminating it can reveal interesting results.
The warmup phase should at least cover 30 minutes; not only to give the caches time to warm-up, but also to give the backend systems time to scale to their “production sizing”; when you don’t execute performance test all the time, the system might scale down because there is no sense in having many systems idling when there is not load.
It can make sense to start not with the 100% of the targeted load, but with smaller numbers and start to increase from there. Because only then you can see the bottleneck which your test hits first. If you start already with 100% you might just see a lot of blockings, but you don’t know which one is the most impeding one.
When you are implementing a performance test in the context of AEM as a Cloud Service, I recommend to also use my checklist for Performance testing on AEM CS which gives some more practical hints how to get your tests right; although a few aspects covered there are covered in more depth in this post series as well.

When you have such a test passing the biggest part of the work is done; and based on your models you can do execute a number of different tests based to answer more questions.

Variations of the basic performance

The above model just covers an totally average day. But of course it’s possible to vary the created scenario to respond to some more questions:

What happens if the load of the day is not 100%, but for some reasons 120%, with identical assumptions about user behavior and traffic distribution? That’s quite simple, because you just increase a number in the performance test.
The basic performance test runs just for a few hours and stops then. It gives you the confidence that the system can operate at least these many hours, but a few issues might go unnoticed. For example memory leaks accumulating over time might get only visible after many hours of load. For that reason it makes sense to run your test for 24-48 hours continuously to validate that there is no degradation over that time.
What’s the behavior when the system goes into overload? An interesting question (but only if it does not break already when hitting the anticipated load) which is normally answered by a break test; then you increase the load more and more, until the situation really gets out of hand. If you have enough time, that’s indeed something you can try, but let’s hope that’s not very relevant 🙂
How does the system behave when your backend systems are not available? What if they come online again?

And probably many more interesting scenarios, which you can think of. But you should only perform these, when you have the basic version test right.

When you have your performance tests passing, the question is still: How does it compare to production load? Are we actually testing the right things? In the next and last post of this series I will cover the options you have when performance test does not match the expected results and also the worst case scenario: What happens if you find out after golive that your performance tests were good, but the production environment behaves very differently?

CDN and dispatcher – 2 complementary caching layers

I sometimes hear the question how to implement cache invalidation for the CDN. Or the question is why AEM CS still operates with a dispatcher layer when it now has a more powerful CDN in front of it.

The questions are very different, but the answer is in both cases: the CDN is no replacement for the dispatcher, and the dispatcher does not replace the CDN. They serve different purposes, and they combination of these two can be a really good package. Let me explain this.

The dispatcher is very traditional cache. It’s fronting the AEM systems and the cache status is actively maintained by cache invalidation so it always delivers current data. But from an end-user perspective this cache is often far away in terms of network latency. If my AEM systems are hosted in Europe, and end-users from Australia are reaching it, the latency can get huge.

The CDN is the contrary, it serves the content from many locations across the world, being as close to the end-user as possible. But the CDN cache invalidation is cumbersome, and for that reason most often TTL-based expiration is used. That means, you have to accept that there is a chance, that new content is already available, but the CDN can still deliver old content.

Not everyone is happy with that; and if that’s a real concern, short TTLs (in the range of a few minutes) are the norm. That means, that many files on the CDN will get stale every few minutes, which results in cache misses; and a cache miss on the CDN goes back to origin. But of course the reality is, that not many pages change every 10 minutes; actually very few. But customers want to have that low TTL just in case a page was changed, and that change needs to get visible to all endusers as soon as possible. .

So you have a lot of cache misses on the CDN, which trigger a re-fetch of the file from origin, and and because many of the files have not changed, you refetch the exactly same binary which got stale seconds ago. Actually a waste of resources, because your origin system delivers the same content over and over again to the CDN a consequence of these misses. So you could keep your AEM instances busy all the time, re-rendering the same requests over and over, always creating the same response.

Introducing the dispatcher caching, fronting the actual AEM instance. If the file has not changed, the dispatcher will deliver the same file (or just HTTP 304 not modified, which even avoids sending the content again). And it’s fast, much faster than letting AEM rendering the same content again. And if the file has actually changed, it’s rendered once and then reused for all the future CDN cache misses.

The combination of these 2 types of caching approaches help you to deliver content from the edge while at the same time having a reasonable latency for content updates (that means the time between replicating a change to the publish instances until all users across the world can see it) without the need to have a huge number of AEM instances in the background.

So as a conclusion, using the CDN and the dispatcher cache is a good combination, if setup properly.

Performance tests modelling (part 3)

This is post 3 in my series about Performance Test Modelling. See the first post for an overview of this topic.

In the previous 2 posts I discussed the importance of having a clearly defined model of the performance tests, and that a good definition the load factors (typically measured by “concurrent users”) is required to build a realistic test.

In this post I cover the influence of the test system and test data on the performance test and its result, and why you should spend effort to create a test with a realistic set of data/content. In this post we will do a few thought experiments, and to judge the results of each experiment, we will use the cache-hit ratio of a CDN as a proxy metric.

Let’s design a performance test for a very simple site: It just consists of 1 page, 5 images and 1 CSS and 1 JS file; 8 files in total. Plus there is a CDN for it. So let’s assume that we have to test with 100, 500 and 1000 concurrent users. What’s the test result you expect?

Well, easy. You will get the same test result for all tests irrespective of the level of concurrency; mostly because after the first requests a files will be delivered from the CDN. That means no matter with what concurrency we test, the files are delivered from the CDN, for which we assume it will always deliver very fast. We do not test our system, but rather the CDN, because the cache hit ratio is quite close to 100%.

So what’s the reason why we do this test at all, knowing that the tests just validate the performance promises of the CDN vendor? There is no reason for it. The only reason why we would ever execute such a test is that on test design we did not pay attention to the data which we use to test. And someone decided that these 7 files are enough for satisfy the constraints of the performance test. But the results do not tell us anything about the performance of the site, which in production will consists of tens of thousands of distinct files.

So let’s us do a second thought experiment, this time we test with 100’000 files, 100 concurrent users requesting these files randomly, and a CDN which is configured to cache files for 8 hours (TTL=8h). With regard to to chache-hit-ratio, what is the expectation?

We expect that the cache-hit ratio starts low for quite some time, this is the cache-warming phase. And then it starts to increase, but it will never hit 100%, as after some time cache entries will expire on the cache and start produce cache-misses. This is a much better model of reality, but it still has a major flaw: In reality, requests are not randomly distributed, but normally there are hotspots.

A hotspot consists of files, which are requested much more often than average. Normally these are homepages or other landing pages, plus other pages which users normally are directed to. This set of files is normally quite small compared to the total amount of files (in the range of 1-2%), but they make up 40-60% of the overall requests, and you can easily assume a Pareto distribution (the famous 80/20 rule), that 20% of the files were responsible for 80% of the requests. That means we have a hotspot and a long-tail distribution of the requests.

If we modify the same performance test to take that distribution into account, we end up with a higher cache-hit ratio, because now the hotspot can be delivered mostly from the CDN. But on the long-tail we will have more cache-misses, because they are requested that rarely, so they can expire on the CDN without being requested again. But in total the cache-hit ratio will be better than with the random distribution, especially on the often-requested pages (which are normally the ones we care about most).

Let’s translate this into a graph which displays the response time.

This test is now quite realistic, and if we only focus on the 95 percentile (p95; that means if we take 100 requests, 95 of them are faster than this) the result would meet the criteria; but beyond that the response time is getting higher, because there are a lot of cache misses.

This level of realism in the test results comes with a price: Also the performance test model and the test preparation and execution are much more complicated now.

And till now we only considered users, but what happens when we add random internet noise and the search engines (the unmodelled users from the first part of this series) into the scenario? These will add more (relative) weight to the long-tail, because these requests do not necessarily follow the usual hotspots, but we have to assume a more random distribution for these.

That means that then the cache-hit ratio will be lower again, as there will be much more cache-misses now; and of course this will also increase the response time of the p95. And: it will complicate the model even further.

So let’s stop here. As I have outlined above, the most simple model is totally unrealistic, but making it more realistic makes the model more complex as well. And at some point the model is no longer helpful, because we cannot transform it into a test setup without too much effort (creating test data/content, complex rules to implement the random and hotspot-based requests, etc). That means especially in the case of the test data and test scenarios we need to find the right balance. The right balance between the investment we want to make into tests and how close it should mirror the reality.

I also tried to show you, how far you can get without doing any kind of performance test. Just based on some assumptions were able to build a basic understanding how the system will behave, and how some changes of the parameters will affect the result. I use this technique a lot and it helps me to quickly refine models and define the next steps or the next test iteration.

In the next post I will discuss various scenarios which you should consider in your performance test model, including some practical recommendations how to include them in your test model.

Performance tests modelling (part 2)

This is is the second blog post in the series about performance test modelling. You can find the overview over this series and links to all its articles in the post “Performance tests modelling (part 1)“.

In this blog post I want to cover the aspect of “concurrent users”, what it means in the context of a performance test and why its important to clearly understand its impact.

Concurrent users is an often used measure to indicate the the load put to a system, expressed by usage in a definition, how many users are concurrently using that system. And for that reason many performance tests provide as quantitative requirement: “The system should be able to handle 200 concurrent users”. While that seems to be a good definition on first sight, it leaves many questions:

What does “concurrent” mean?
And what does “user” mean?
Are “200 concurrent users” enough?
Do we always have “200 concurrent users”?

Definition of concurrent

Let’s start with the first question: What does “concurrent” really mean on a technical level? How can we measure that our test indeed does “200 concurrent users” and not just 20 or 1000?

Are there any server-side sessions which we can count and which directly give this number? And that we setup our test in a way to hit that number?
Or do we have to rely on more vague definitions like “users are considered concurrent when they do a page load less than 5 minutes apart”? And that we design our test in that way?

Actually it does not matter at all, which definition you choose. It’s just important that you explicitly define which definition you use. And what metric you choose to understand that you hit that number. This is an important definition when it comes to implementing your test.

And as a side-note: Many commercial tools have their own definition of concurrent, and here the exact definition does not matter as well, as long as you are able to articulate it.

What is a user?

The next question is about “the user” which is modeled in the test; to simplify the test and test executions one or more “typical” user personas are created, which visit the site and perform some actions. Which is definitely helpful, but it’s just that: A simplification, because otherwise our model would explode because of the sheer complexity and variety of user behavior. Also sometimes we don’t even know what a typical “user” does on our site, because that system will be brand-new.

So this is a case, where we have a huge variance in the behavior of the users, which we should outline in our model as a risk: The model is only valid if the majority of the users are behaving more or less as we assumed.

But is this all? Are really all users do at least 10% of the actions we assume they do?

Let’s brainstorm a bit and try to find answers for these questions:

Does the google bot behave like that? All the other bots of the search engines?
What about malware scanners which try to hit a huge list of WordPress/Drupal/… URLs on your site?
Other systems performing (random?) requests towards your site?

You could argue, that this traffic has less/no business value, and for that reason we don’t test for it. Also it could be assumed that this is just a small fraction of the overall user traffic, and can be ignored. But that is just an assumption, and nothing more. You just assume that it is irrelevant. But often these requests are not irrelevant, not all all.

I encountered cases where not the “normal users” were bringing down a system, but rather this non-normal type of “user”. An example for that are cases where the custom 404 handler was very slow, and for that reason the basic undocumented assumption “We don’t need to care about 404s, as they are very fast” was violated and brought down the site. All performance tests passed, but the production system failed nevertheless.

So you need to think about “user” in a very broad sense. And even if you don’t implement the constant background noise of the internet in your performance test, you should list it as factor. If you know that a lot of this background noise will trigger a HTTP statuscode 404, you are more likely to check that this 404 handler is fast.

Are “200 concurrent users” enough?

One information every performance has is the number of concurrent users which the system must be able to handle. But even if we assume, that “concurrent” and “users” are both defined as well, is this enough?

First, on what data is this number based on? Is it a number based on data derived from another system, which the new system should replace? That’s probably the best data you can get. Or when you build a new system, is it based on good marketing data (which would be okay-ish), based on assumptions of the expected usage or just numbers we would like to see (because we assume that a huge number of concurrent users means a large audience and a high business value)?

So probably this is the topic which will be discussed the most. But the number and the way how that number is determined should be challenged and vetted. Because it’s one the corner-stones of the whole performance test model. It does not make sense to build a high performance and scalable system when afterwards you find out that the business numbers we grossly overrated, and a smaller and cheaper solution would have delivered the same results.

What about time?

A more important is aspect which is often overlooked is the timing; how many users are working on the site at every moment? Do you need to expect the maximum number 8 hours every day or just during the peak days of the year? Do you have a more or less constant usage or only during business hours in Europe?

This heavily depends on the type of your application and the distribution of your audience. If you build an intranet site for a company only located in Europe, the usage during the night is pretty much “zero”, and it will start to increase at 0600 in the morning (probably the Germans going to work early :-)), hitting the max usage between 09 and 16 o’clock and going to zero at latest at 22 o’clock. The contrast to it is a site visited world-wide by customers, where we can expect a higher and almost flat line; of course with variations depending on the number of people being up.

This influences your tests as well, because in both cases you don’t need to simulate spikes, that means a 500% increase of users within 5 minutes. On the other hand, if you plan for large marketing campaigns addressing millions of users, this might exactly be the situation you need to plan and test for. Not to mention if you book a slot during the Superbowl break.

Why is this important? Because you need to test only scenarios which you will expect to see in production. And ignore scenarios which we don’t have any value for you. For example it’s a waste of time and investment to test for a sudden spike in the above mentioned intranet case for the European company, while it’s essential for marketing campaigns to test a scenario, where such a spike comes on top of the normal traffic.

Summary

“N concurrent users” itself is not much information; and while it can serve as input, your performance test model should contain a more detailed understanding of that definition and what it means to the performance test. Otherwise you will focus just on a given number of users of this idealistic type and ignore every other scenario and case.

In the next blog post I will cover how the system and the test data itself will influence the result of the performance test.

Performance tests modelling (part 1)

In my last blog post about performance test I outlined best practices about building and executing a performance test with AEM as a Cloud Service. But intentionally I left out a huge aspect of the topic:

How should your test look like?
What is a realistic test?
And what can a test result tell you about the behavior of your production environment?

These are hard question, and I often find that these questions are not asked. Or people are not aware that these questions should be asked.

This is the first post in a series of blog posts, in which I want to dive a bit deeper into performance testing in the context of AEM and AEM CS (and many aspects can probably get generalized to other web applications as well). Unlike my other blog posts it addresses topics on a higher level (I will not refer to any AEM functionality or API, and won’t even mention AEM that often), because I learned over time, that very often performance tests are done based on a lot of assumptions. And that it’s very hard to discuss the details of a performance tests if these assumptions are not documented explicitly. I had such discussions in these 2 contexts:

The result of a performance test (in AEM as a Cloud Service) is poor and the customer wants to understand what Adobe will do.
After golive severe performance problems show up on production; and the customer wants to understand how this can happen as their tests showed no problems.

As you can imagine, if I am given just a few diagrams with test results and test statistics as preparation for this call with the customer … this is not enough, and very often more documentation about the tests is not available. Which often leads to a lot of discussions about some very basic things and that adds even more delay to an already late project and/or bad customer experience. So you can also consider this blog series as a kind of self-defense. If you were asked to read this post, now you know 🙂

I hope that this series will also help you improve your way of doing performance tests, so we all will have less of these situations to deal with.

This post series consists of these individual posts:

Part 1 (this post): What is a performance test? Why do we have these tests at all?
Part 2: What is the relevance of “number of concurrent users”? What does it even mean?
Part 3: What is the impact of test system and the test data on the result of the test?
Part 4: What scenarios should be covered by a performance test
Part 5: What are your options if the test results do not match the observations in the production environment?

And a word upfront to the term “performance test”: I summarize a number of different tests types under that term, which are executed with different intentions, and which come with many names: “Performance tests”, “Load tests”, “Stress tests”, “Endurance tests”, “Soak tests”, and many more. Their intention and execution differ, but in the end they can all benefit from the same questions which I want to cover this blog series. So if you read “performance test”, all of these other tests are meant as well.

What is a performance test? And why do we do them?

A performance test is a tool to predict the future, more specifically how a certain system will behave in a more-or-less defined scenario.

And that outlines already two problems which performance tests have.

It is a prediction of the future. Unlike a science experiment it does not try to understand the presence and extrapolate into the future. It does not the have same quality as “tomorrow we will have a sunrise, even if the weather is clouded”, but rather goes into the direction of “if my 17 year old son wants to celebrate his birthday party with his friends at home, we better plan a full cleaning of the house for the day after”. That means no matter how well you know your son and his friends (or the system you are building), there is still an element of surprise and unknown in it.
The scenario which we want to simulate is somehow “defined”. In quotes, because in many cases the definitions of that scenario are pretty vague. We normally base these definitions on previous experience we have made and some best practices of the industry.

So it’s already clear from these 2 items, that this prediction is unlikely to be exact and 100% accurate. But it does not need to be accurate, it just needs to be helpful.

A performance test is helpful if it delivers better results than our gut feeling; and the industry has learned that our gut feeling is totally unreliably when it comes to the behaviour of web applications under production load. That’s why many enterprise golive procedures require a performance tests, which will always deliver a more reliable result as gut feeling. But just creating and executing a performance test does not make this a helpful performance test.

So a helpful performance test is also a test, which mimics the reality close enough, that you don’t need to change your plans immediately after your system goes live and hits reality. Unfortunately you only know if your performance test was helpful after you went live. It shares this situation with other test approaches as well; for example a 100% unittest coverage does not mean, that your code does not have bugs, it’s just less likely.

What does that mean for performance tests and their design?

First, a performance test is based on a mental model of your system and the to-be reality, which must be documented. All its assumptions and goals should be explicitly documented, because only then a review can be done. And a review helps to uncover blind spots in our own mental model of the system, its environment and the way how it is used. It helps to clearly outline all known factors which influence the test execution and also its result.

Without that model, it is impossible to compare the test result with reality and try to understand which factor or aspect in the test was missing, misrepresented or not fully understood, which lead to a gap between test result and reality. If you don’t have a documented model, it’s possible to question everything, starting from the model to the correct test execution and the results. If you don’t have a model, the result of a performance test is just a PDF with little to no meaning.

Also you must be aware that this mental model is a massive simplification, as it is impossible to factor in all aspects of the reality, also because the reality changes every day. You will change your application, new releases of AEM as a Cloud Service will be deployed, you add more content, and so on.

Your mental model will never be complete and probably also never be up-to-date, and that will be reflected in your performance test. . But if you know that, you can factor it in. For example you know that in 3 months time the number of content has doubled, and you can decide if it’s helpful to redo the performance test with changed parameters. It’s now a “known unknown”, and no more a “unknown unknown”. You can even decide to ignore factors, if they deem not relevant to you, but of course you should document it.

When you have designed and documented such a model, it is much easier to implement the test, execute the test and reason about the results. Without such a model, there is much more uncertainties in every piece of the test execution. It’s like developing software without a clear and shared understanding what exactly you want to develop.

That’s enough for this post. As promised, this is more abstract than usual, but I hope you liked it and it helps to improve your tests. In the next blog posts I will look into a few relevant aspects which should be covered by your model.

Sling Model Exporter & exposing ResourceResolver information

Welcome to 2024. I will start this new year with a small advice regarding Sling Models, which I hope you can implement very easy on your side.

The Sling Model Exporter is based on the Jackson framework, and it can serialize an object graph, with the root being the requested Sling Model. For that it recursively serializes all public & protected members and return values of all simple getters. Properly modeled this works quite well, but small errors can have large consequences. While missing data is often quite obvious (if the JSON powers an SPA, you will find it not properly working), too much data being serialized is spotted less frequently (normally not at all).

I am currently exploring options to improve performance, and I am a big fan of the ResourceResolver.getPropertyMap() API to implement a per-resourceresolver cache. While testing such an potential improvement I found customer code, in which the ResourceResolver is serialized via the Sling Model Exporter into JSON. In that case the code looked like this:

@SlingModel
public class MyModel {
 @Self
 Resoruce resource;
 
 ResourceResolver resolver;

 @PostConstruct
 public void init() {
   resolver = resource.getResourceResolver();
 }
}

(see this good overview at Baeldung of the default serialization rules of Jackson.)

And that’s bad in 2 different aspects:

Security: The serialized ResourceResolver object contains next to the data returned by the public getters (e.g. the search paths, userId and potentially other interesting data) also the complete propertyMap. And this serialized cache is probably nothing you want to expose to the consumer of this JSON.
Exceptions: If the getProperty() cache contains instances of classes, which are not publicly exposed (that means these class definitions are hidden within some implementation packages), you will encounter ClassNotFound exceptions during serialization, which will break the export. And instead a JSON you get an internal server error or a partially serialized object graph.

In short: It is not a good idea to serialize a ResourceResolver. And honestly, I have not found a reason to say why this should be possible at all. So right now I am a bit hesitant to use the propertMap as cache, especially in contexts where the Sling Model Exporter might be used. And that blocks me to work on some interesting performance improvements 😦

To unblock this situation, we have introduced a 2 step mechanism, which should help to overcome this situation:

In the latest AEM as a Cloud Service release 14697 (both in the cloud as well as in the SDK) a new WARN message has been added when your Model definition causes a ResourceResolver to be serialized. Search the logs for this message “org.apache.sling.models.jacksonexporter.impl.JacksonExporter A ResourceResolver is serialized with all its private fields containing implementation details you should not disclose. Please review your Sling Model implementation(s) and remove all public accessors to a ResourceResolver.“
It should contain also a referecene to the request path, where this is happening, so it should be easily possible to identify the Sling model class which triggers this serialization and change that piece of code so the ResourceResolver is not serialized anymore. Note, that the above message is just a warning, the behavior remains unchanged.
As a second measure also functionality is implemented, which allows to block the serialization of ResourceResolver via the Sling Model Exporter completely. Enabling this is a breaking change for all AEM as a Cloud Service customers (even if I am 99.999% sure that it won’t break any functionality), and for that reason we cannot enable this change on the spot. But at some point this is step is necessary to guarantee that the above listed 2 problems will never happen.

Right now the first step is enabled, and you will see this log message. If you see this log message, I encourage you to adapt your code (the core components should be safe) so ResourceResolvers are no longer serialized.

In parallel we need to implement step 2; right now the planning is not done yet, but I hope to activate step 2 some time later in 2024 (not before mid of the year). But before this is done, there will be formal announcements in the AEM release notes. And I hope that with this blog post and the release notes all customers have adapted their implementation, so that setting this switch will not change anything.

Update (January 19, 2024): There is now a piece of official AEM documentation covering this situation as well.

A review of 2023

It’s again December, and so time to review a bit my activities of 2023 in this blog.

I have to admit, I am not a reliable writer, as I write very infrequent. And it’s not because of lack of time, but rather because I rarely find content which (in my opinion) is worth to write about. I don’t have large topics, which i split up into a series of posts. If you ever saw a smaller series of posts, that mostly happen by accident. I was just working on aspects of the system, which at some point I wrote about and afterwards started to understand more. That was the default of the last 15 years … (OMG, am I blogging here really for that long already? Probably, the first post “Why use the dispatcher?” went live on December 22, 2008. So this is the 15th anniversary.)

I started 2023 with 4 posts till the end of of September:

But something changed in October. I had already prepared 2 postings, but I started to come up with more topics within days; it ended with 6 blog posts in October and November, which is quite a pace for this blog:

It felt incredible to be able to announce every few days a new blog post. I don’t think that I can keep that frequency, but I will try to write more often in 2024. I just noted a few topics for the next posts already, stay tuned 🙂

Also, if you are reading this blog because you found the link to it somewhere, but you are interested in the topics I write about: You can get notified of new posts immediately by providing me (well, WordPress) your email address (you should see it on the right rail of this page). Alternatively if you are old-style, you can also subscribe to the RSS Feed of this blog, which also contains the full content of the postings. That might be interesting for you, as I normally reference new posts on other channels with some delay, and sometimes I even skip it completely (or simply forget).

Thanks for your attention, and I wish you all a successful and happy 2024.

Thoughts on performance testing on AEM CS

Performance is an interesting topic on its own, and I already wrote a bit about it in this blog (see the overview). I have not written yet about performance testing in the context of AEM CS. It’s not that it is fundamentally different, but there are some specifics, which you should be aware of.

Perform your performance tests on the Stage environment. The stage environment is kept at the same sizing as the production environment, so it should deliver the same behavior. and your PROD environment, if you have the same content and your test is realistic.
Use a warmup phase. As the Stage environment is normally downscaled to the minimum (because there is no regular traffic), it can take a bit of time until it has upscaled (automatically) to the same number of instances as your PROD is normally operating with. That means that your test should have a proper warmup period, during you which increase the traffic to the normal 100% level of production. This warmup phase should take at least 30 minutes.
I think that any test should take at least 60-90 minutes (including warmup); even if you see early that the result is not what you expect to be, there is often something to learn even from such incorrect/faulty situations. I had the case that a customer was constantly terminating the after about 20-25 minutes, claiming that something was not working server-side as they expected it to be. Unfortunately the situation has not yet settled, so I was not able to get any useful information from the system.
AEM CS comes with a CDN bundled to the environment, and that’s the case also for the Stage environment. But that also means that your performance test should contain all requests, which you expect to be delivered from the CDN. This is important because it can show if your caching is working as intended. Also only then you can assess the impact of the cache misses (when files expire on the CDN) on the overall performance.
While you are at it, you can run a stage pipeline during the performance test and deploy new code. You should not see any significant change in performance during that time.
Oh yes, also do some content activations in that time. That makes your test much more realistic and also reveal potential performance problems when updating content (e.g. because you constantly invalidate the complete dispatcher cache).
You should focus on a large content set when you do the performance test. If you only test a handful of pages/assets/files, you are mostly testing caches (at all levels).
“Campaign-traffic” is rarely tested. This is traffic, which has some query strings attached (e.g. “utm_source”, “gclid” and such) to support traffic attribution. These parameters are ignored while rendering, but they often bypass all caching layers, hitting AEM. And while a regular performance test only tests without these paramters, if you marketing department runs a facebook campaign, the traffic from that campaign looks much different, and then the results of your performance tests are not valid anymore.

Some words as precaution:

A performance test can look like a DOS, and your requests can get blocked for that reason. This can happen especially if these requests are originating from a single source IP. For that reason you should distribute your load injector and use multiple source IP addresses. In case you still get blocked, please contact support so we can adapt accordingly.
AEM CS uses an affinity cookie to indicate that requests of a user-agent are handled by a specific backend system. If you use the same affinity cookie throughout all your performance tests, you just test a single backend system; and that effectively disables any loadbalancing and renders the performance test results unusable. Make sure that you design your performance tests with that in mind.

I general I prefer it much if I can help you during the performance phase, than to handle escalations for of bad performance and potential outages because of it. I hope that you think the same way.