Do not use AEM as a proxy for backend calls

Since I am working with AEM CS customers, I came a few time across the architecture pattern, that requests made to a site to passed all the way through to the AEM instance (bypassing all caches), and then AEM does an outbound request to a backend system (for example a PIM system or other API service, sometimes public, sometimes via VPN), collects the result and sends back the response.

This architectural pattern is problematic in a few ways:

AEM handles requests with a threadpool, which has an upper limit of requests it will handle (by default 200). That means that at any time the number of such backend requests is limited by the amount of AEM instances. In AEM CS this number is variable (auto-scaling), but even in an auto-scaling world there is an upper limit.
The most important factor in the number of such requests AEM can handle per second is the latency of the backend system call. For example if your backend system responds always in less than 100ms, your AEM can handle up to 2000 of such proxy requests per second. If the latency is more likely 1 second, it’s only up to 200 proxy requests per second. This can be enough, this can be way too small.
To achieve such a throughput consistently, you need to have agressive timeouts; if you configure your timeouts with 2 seconds, your guaranteed throughput can only be up to 100 proxy requests/seconds.
And next to all those proxy requests your AEM instances also need to handle the other duties of AEM, most importantly rendering pages and delivering assets. That will reduce the number of threads you can utilize for such backend calls.

The most common issue I have seen with this pattern is that in case of backend performance problems the AEM threadpool of all AEM instances are consumed within seconds, leading almost immediately to an outage of the AEM service. That means, that a problem on the backend or on the connection between AEM and the backend takes down your page rendering abilities, leaving you with what is cached at the CDN level.

The common recommendation we make in these cases is quite obvious: introduce more agressive timeouts. But the actual solution to this problem is a different one:

Do not use AEM as a proxy.

This is a perfect example for a case, where the client (browser) itself can do the integration. Instead of proxy-ing (=tunneling) all backend traffic through AEM, the client could approach the backend service directly. Because then the constraints AEM has (for example the number of concurrent requests) do no longer apply for the calls to the backend. Instead the backend is exposed directly to the endusers, and uses whatever technology is suitable for that; typically it is exposed via an API gateway.

If the backend gets slow, AEM is not affected. If AEM has issues, the backend is not directly impacted because of it. AEM does not even need to know that there is a backend at all. Both systems are entirely decoupled.

As you see, I pretty much prefer this approach of “integration at the frontend layer” and exposing the backend to the endusers over any type of “AEM calls the backend systems”. Mostly because such architectures are less complex and easier to debug and analyze. And that should be your default and preferred approach, whenever this required.

Disclaimer: Yes, there are cases where the application logic requires AEM to do backend calls; but in these cases it’s questionable if such requests need to be done synchronously in requests, meaning that an AEM request needs to do a backend call to consume its result. If these request can be done async, then the whole problem vector I outlined above simply does not exist.

Note: In my opinion hiding the hostnames of your backend system is also not a good reason for such an backend integration. Also “the service is just available from within our company network and AEM accesses it via VPN” is not a good reason, too. In both cases you can achieve the same with an publicly accessible API gateway, which is specifically designed to handle such usecases and all security-relevant implications of it.

So, do not use AEM as a simple proxy!

My view on manual cache flushing

I read the following statement by Samuel Fawaz on LinkedIn regarding the recent announcement of the self-service feature to get the API key for CDN purge for AEM as a Cloud Service:

[…] 𝘚𝘰𝘮𝘦𝘵𝘪𝘮𝘦𝘴 𝘵𝘩𝘦 𝘊𝘋𝘕 𝘤𝘢𝘤𝘩𝘦 𝘪𝘴 𝘫𝘶𝘴𝘵 𝘮𝘦𝘴𝘴𝘦𝘥 𝘶𝘱 𝘢𝘯𝘥 𝘺𝘰𝘶 𝘸𝘢𝘯𝘵 𝘵𝘰 𝘤𝘭𝘦𝘢𝘯 𝘰𝘶𝘵 𝘦𝘷𝘦𝘳𝘺𝘵𝘩𝘪𝘯𝘨. 𝘕𝘰𝘸 𝘺𝘰𝘶 𝘤𝘢𝘯.

I fully agree, that a self-service for this feature was overdue. But I always wonder why an explicit cache flush (both for CDN and dispatcher) is necessary at all.

The caching rules are very simple, as the rules for the AEM as a Cloud Service CDN are all based on the TTL (time-to-live) information sent from AEM or the dispatcher configuration. The caching rules for the dispatcher are equally simple and should be well understood (I find that this blog post on the TechRevel blog covers this topic of dispatcher cache flushing quite well).

In my opinion it should be doable to build a model which allows you to make assumptions, how long it takes for a page update to be visible to all users on the CDN. And it also allows you to reason about more complex situations (especially when content is pulled from multiple pages/areas to render) and understand how and when content changes are getting visible for endusers.

But when I look at the customer requests coming in for cache flushes (CDN and dispatcher), I think that in most cases there is no clear understanding what actually happened; most often it’s just that on the authoring the content is as expected and activated properly, but this change does not show up the same way on publish. The solution is often to request a cache flush (or trigger it yourself) and hope for the best. And very often this fixes the problem, and then the most up-to-date content is delivered.

But is there an understanding why the caches were not updated properly? Honestly, I doubt that very often. The same way as infamous “Windows restart” can fix annoying, suddenly appearing problems with your computer, flushing caches seems be one of the first steps for fixing content problems. The issues goes away, we shrug and go on with our work.

But unlike in the case of Windows the situation is different here, because you have the dispatcher configuration in your git repository. And you know the rules of caching. You have everything you need to have to understand the problem better and even fix it from happening again.

Whenever the authoring users come to you with that request “content is not showing up, please flush the cache”, you should consider this situation as a bug. Because it’s a bug, as the system is not work as expected. You should apply the workaround (do the flush), but afterwards invest time into the analysis and root-cause analysis (RCA), why it happened. Understand and adjust the caching rules. Because very often these cases are well reproducible.

In his LinkedIn post Samuel writes “Sometimes the CDN cache is just messed up“, and I think that is not true. It’s not that it’s a random event you cannot influence at all. On the contrary. It’s an event which is defined by your caching configuration. It’s an event which you can control and prevent, you just need to understand how. And I think that this step of understanding and then fixing it is missing very often. And then the next from request from your authoring users for a cache flush is inevitable, and another cache flush is executed.

In the end flushing caches comes with the price of increased latency for endusers until the cache is populated again. And that’s a situation we should avoid as good as we can.

So as a conclusion:

An explicitly requested cache clear is a bug because it means that something is not working as expected.
And as every bug it should be understood and fixed, so you are no longer required to perform the workaround.

Adopting AEM as a Cloud Service: Shifting from Code-Centric Approaches

The first CQ5 version I worked with was CQ 5.2.0 in late 2009; and since then a lot changed. I could list a lot of technical changes and details, but that’s not the most interesting part. I want to propose this hypothesis as the most important change:

CQ5 was a framework which you had to customize to get value out of it. Starting with AEM 6.x more and more out-of-the-box features were added which can be used directly. In AEM as a Cloud Service most new features are directly usable, not requiring (or even allowing) customization.

And as corollary: The older your code base the more customizations, and the harder is the adoption of new features.

As a SRE in AEM as a Cloud Service I work with many customers, which migrated their application over from an AEM 6.x version. While the “best practice analyzer” is a great help to get your application ported to AEM CS, it’s just this: It helps you to migrate your customizations, the (sometimes) vast amount of overlays for the authoring UI, backend integrations, complex business and rendering logic, JSPs, et cetera. And very often this code is based on the AEM framework only and could technically still run on CQ 5.6.1, because it works with Nodes, Resources, Assets and Pages as the only building blocks.

While this was the most straight-forward way in the times of CQ5, it becomes more and more a problem in later versions. With the introduction of Content Fragments, Experience Fragments, Core Components, Universal Editor, Edge Delivery Services and others, many new features were added which often do not fit into the self-grown application structures. These product features are promoted and demoed, and it’s understandable that the business users want to use them. But the adoption of these new features would often require large refactorings, proper planning and a budget for it. Nothing you do in a single 2-week sprint.

But this situation also has impact on the developers themselves. While customizations through code were the standard procedure in CQ5, there are often other ways available in AEM CS. But when I read through the AEM forums and new blog posts for AEM, I still see a large focus on coding: Custom servlets, sling models, filters, whatever. Often using the same old CQ5 style we had to use 10 years ago, because there was nothing else. That approach still works, but it will lead you into the customization hell again. Also many in violation of the practices recommended for AEM CS.

That means:

If you want to start an AEM CS project in 2024, please don’t follow the same old approach.
Make sure that you understand the new features introduced in the last 10 years, and how you can mix and match them to implement the requirements.
Opening the IDE and start coding should be your last resort.

It also makes sense to talk with Adobe about the requirements you need to implement; I see that features requested by many customers are often prioritized and are implemented with customer involvement; a way which is much easier to do in AEM CS than before.

AEM CS & Mongo exceptions

If you are an avid log checker on your AEM CS environments you might have come across messages like this in your authoring logs:

02.04.2024 13:37:42:1234 INFO [cluster-ClusterId{value='6628de4fc6c9efa', description='MongoConnection for Oak DocumentMK'}-cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net:27017] org.mongodb.driver.cluster Exception in monitor thread while connecting to server cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net:27017 com.mongodb.MongoSocketException: cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net 
at com.mongodb.ServerAddress.getSocketAddresses(ServerAddress.java:211) [org.mongodb.mongo-java-driver:3.12.7] 
at com.mongodb.internal.connection.SocketStream.initializeSocket(SocketStream.java:75) [org.mongodb.mongo-java-driver:3.12.7] 
...
Caused by: java.net.UnknownHostException: cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net

And you might wonder what is going on. I get this question every now and then, often assuming that this something problematic. Because we have all learned that stacktraces normally indicate problems. And on first sight this indicates a problem, that a specific hostname cannot be resolved. Is there a DNS problem in AEM CS?

Actually this message does not indicate any problem. The reason behind this is the way how mongodb implemented scaling operations. If you up- or downscale the mongo cluster, this does not happen in-place, but you get actually a new mongo cluster of the new size and of course the same content. And this new cluster comes with a new hostname.

So in this situation there was a scaling operation, and AEM CS connected to the new cluster and now looses connection to the old cluster, because the older cluster is stopped and its DNS entry is removed. Which is of course expected. And for that reason you can also see that this is logged on level INFO, and not as an ERROR.

Unfortunately this is a log message created by the mongo-driver itself, so this cannot be changed on the Oak level by removing the stacktrace from this message and changing the message itself. And for that reason you will continue to see it in the AEM CS logs, until a new improved mongo driver changes that.

Performance test modelling (part 5)

This is part 5 and the final post of the blog post series about performance test modelling; see part 1 for an overview and the links to all articles of this series.

In the previous post I discussed the impact of the system which we test, how the modelling of the test and the test content will influence the result of the performance test, and how you implement the most basic scenario of the performance tests.

In this blog post I want to discuss the predicted result of a performance test and the actual outcome of it, and what you can do when these do not match (actually they rarely do on the first execution). Also I want to discuss the situation where after golive you encounter that a performance test delivered the expected results, but did not match the observed behavior in production.

The performance test does not match the expected results

In my experience every performance, no matter how good or bad the basic definition is, contains at least 2 relevant data points:

the number of concurrent users (we discussed that already in part 1)
and an expected result, for example that the transaction must be completed within N seconds.

What if you don’t meet the performance criteria in point 2? This is typically the time when customers in AEM as a Cloud Service start to raise questions to Adobe, about number of pods, hardware details etc, as if the problem can only be the hardware sizing on the backend. If you don’t have a clear understanding about all the implications and details of your performance tests, this often seems to be the most natural thing to ask.

But if you have built a good model for your performance test, your first task should be to compare the assumptions with the results. Do you have your expected cache-hit ratio on the CDN? Were some assumptions in the model overly optimistic or pessimistic? As you have actual data to validate your assumptions you should do exactly that: go through your list of assumptions and check each one of them. Refine them. And when you have done that, modify the test and start another execution.

And at some point you might come to the conclusion, that all assumptions are correct, you have the expected cache-hit ratio, but the latency of the cache misses is too high (in which case the required action is performance tuning of individual requests). Or that you have already reduced the cache MISSES (and cache PASSES) to the minimum possible and that the backend is still not able to handle the load (in which case the expected outcome should be an upscale); or it can also be both.

That’s fine, and then it’s perfect to talk to Adobe, and share your test model, execution plan and the results. I wrote in part 1:

As you can imagine, if I am given just a few diagrams with test results and test statistics as preparation for this call with the customer … this is not enough, and very often more documentation about the test is not available. Which often leads to a lot of discussions about some very basic things and that adds even more delay to an already late project and/or bad customer experience.

But in this situation, when you have a good test model and have done your homework already, it’s possible to directly have a meaningful discussion without the need to uncovering all the hidden assumptions. Also, if you have that model at hand, I assume that performance tests are not an afterthought, and that there are still reasonable options to do some changes, which will either completely fix the situation or at least remediate the worst symptoms, without impacting the go-live and the go-live date too much.

So while this is definitely not the outcome we all work, design, build and ultimately hope for, it’s still much better than the 2nd option below.

I hope that I don’t need to talk about unrealistic expectations in your performance tests, for example delivering a p99,9 with 200 ms latency, while at the same time requiring a good number of requests always be handled by the AEM backend. You should have detected these unrealistic assumptions much earlier, mostly during design and then in the first runs during the evolution phase of your test.

Scenario 2: After go-live the performance is not what it’s supposed to be

In this scenario a performance test was either not done at all (don’t blame me for it!) or the test passed, but the results of the performance tests did not match the observed reality. This often shows up as outages in production or unbearable performance for users. This is the worst case scenario, because everyone assumed the contrary as the performance test results were green. Neither the business nor the developer team are prepared for it, and there is no time for any mitigation. This normally leads to an escalated situation with conference calls, involvement from Adobe, and in general a lot of stress for all parties.

The entire focus is on mitigation, and we ( I am speaking now as a member of the Adobe team, who is often involved in such situations) will try to do everything to mitigate that situation by implementing workarounds. As in many cases the most visible bottleneck is on the backend side, upscaling the backend is indeed the first task. And often this helps to buy you some time to perform other changes. But there are even cases, where an upscale of 1000% would be required to somehow mitigate that situation (which is possible, but also very short-lived, as every traffic spike on top will require additional 500% …); also it’s impossible to speed up the latency of a single-threaded request of 20 seconds by adding more CPU. These cases are not easy to solve, and the workaround often takes quite some time, and is often very tailored; and there cases where a workaround is not even possible. In any way it’s normally not a nice experience for no-one of the involved parties.

I refer to all of these actions as “workaround“. In bold. Because they are not not the solution to the challenge of performance problems. They cannot be a solution because this situation proves that the performance test was testing some scenarios, but not the scenario which shows in the production environment. It also raises valid concerns on the reliability of other aspects of the performance tests, and especially about the underlying assumptions. Anyway, we are all trying to do our best to get the system back to track.

As soon as the workarounds are in place and the situation is somehow mitigated, 2 types of questions will come up:

How does a long-term solution look like?
Why did that happen? What was wrong with the performance test and the test results?

While the response to (1) is very specific (and definitely out of scope of this blog post), the response to (2) is interesting. If you have a good documented performance test model you can compare its assumptions with the situation in which the production performance problem happened. You have the chance to spot the incorrect or missing assumption, adjust your model and then the performance test itself. And with that you should be able to reproduce your production issue in a performance test!

And if you have a performance failing test, it’s much easier to fix the system and your application, and apply some specific changes which fix this failed test. And it gives you much more confidence that you changed the right things to make the production environment handle the same situation again in a much better way. Interestingly, this gives also to some large extent the response to the question (1).

If you don’t have such a model in this situation, you are bad off. Because then you either start building the performance test model and the performance test from scratch (takes quite some time), or you switch to the “let’s test our improvements in production” mode. Most often the production testing approach is used (along with some basic testing on stage to avoid making the situation worse), but even that takes time and a high number of production deployments. While you can say it’s agile, other might say it’s chaos and hoping for the best… the actual opposite of good engineering practice.

Summary

In summary, when you have a performance test model, you are more likely to have less problems when your system goes live. Mostly because you have invested time and thoughts in that topic. And because you acted on it. It will not prevent you from making mistakes, forgetting relevant aspects and such, but if that happens you have a good basis to understand quickly the problem and also a good foundation to solve them.

I hope that you learned in these posts some aspects about performance tests which will help you to improve your test approach and test design, so you ultimately have less unexpected problems with performance. And if you have less problems with that, my life in the AEM CS engineering team is much easier 🙂

Thanks for staying with me for throughout this first planned series of blog posts. It’s a bit experimental, although the required structure in this topic led to some interesting additions on the overall structure (the first outline just covered 3 posts, now we are at 5). But I think that even that is not enough, I think that some aspects deserve a blog post on their own.

Performance test modelling (part 4)

This the 4th post of the blog post series about performance test modelling; see part 1 for an overview and the links to all articles of this series.

In the parts 2 and 3 I outlined relevant aspects when it comes to model your performance tests:

The modelling of the expected load, often as expressed as “concurrent users”.
The realistic modelling of the system where we want to conduct the performance tests, mostly regarding the relevant content and data.

In this blog post I want show how you deduce from that data, what specific scenarios you should cover by a performance tests. Because there is no single test, which tells you that the resulting end-user performance is good or not.

The basic performance test scenario

Let’s start with a very simple model, where we assume that the traffic rate is quite identical for the whole day; and therefor the performance test resembles that model:

On first sight this is quite simple to model, because you performance test will execute requests at a constant rate for the whole period of time.

But as I outlined in part 3, even if it seems that simple, you have to include at least some background noise. Also you have to take into account, that initially the cache-hit ratio is poor at the beginning, so you have to implement a cache-warmup phase (normally implement as a ramp-up phase, in which the load is increasing up the planned plateau) and just start to measure there.

So our revised plan rather looks like this this

Such a test execution (with the proper modelling of users, requests and requested data) can give you pretty good results if your model assumes a pretty constant load.

What about if your model requires you model a much more fluctuating request rate (for example if your users/visitors are primarily located in North America, and during the night you have almost no traffic, but it starts to increase heavily on the american morning hours? In that case you probably model the warmup in a way, that it resembles the morning increase on traffic, both in frequency and rate. That shouldn’t be that hard, but requires a bit more explicit modelling than just a simple rampup.

To give you some practical hints towards some basic parameters:

Such a performance test should run at least 2-3 hours, and even if you see that the results are not what you expect, not terminating it can reveal interesting results.
The warmup phase should at least cover 30 minutes; not only to give the caches time to warm-up, but also to give the backend systems time to scale to their “production sizing”; when you don’t execute performance test all the time, the system might scale down because there is no sense in having many systems idling when there is not load.
It can make sense to start not with the 100% of the targeted load, but with smaller numbers and start to increase from there. Because only then you can see the bottleneck which your test hits first. If you start already with 100% you might just see a lot of blockings, but you don’t know which one is the most impeding one.
When you are implementing a performance test in the context of AEM as a Cloud Service, I recommend to also use my checklist for Performance testing on AEM CS which gives some more practical hints how to get your tests right; although a few aspects covered there are covered in more depth in this post series as well.

When you have such a test passing the biggest part of the work is done; and based on your models you can do execute a number of different tests based to answer more questions.

Variations of the basic performance

The above model just covers an totally average day. But of course it’s possible to vary the created scenario to respond to some more questions:

What happens if the load of the day is not 100%, but for some reasons 120%, with identical assumptions about user behavior and traffic distribution? That’s quite simple, because you just increase a number in the performance test.
The basic performance test runs just for a few hours and stops then. It gives you the confidence that the system can operate at least these many hours, but a few issues might go unnoticed. For example memory leaks accumulating over time might get only visible after many hours of load. For that reason it makes sense to run your test for 24-48 hours continuously to validate that there is no degradation over that time.
What’s the behavior when the system goes into overload? An interesting question (but only if it does not break already when hitting the anticipated load) which is normally answered by a break test; then you increase the load more and more, until the situation really gets out of hand. If you have enough time, that’s indeed something you can try, but let’s hope that’s not very relevant 🙂
How does the system behave when your backend systems are not available? What if they come online again?

And probably many more interesting scenarios, which you can think of. But you should only perform these, when you have the basic version test right.

When you have your performance tests passing, the question is still: How does it compare to production load? Are we actually testing the right things?

In part 5 (the last post of this series) I cover the options you have when performance test does not match the expected results and also the worst case scenario: What happens if you find out after golive that your performance tests were good, but the production environment behaves very differently?

CDN and dispatcher – 2 complementary caching layers

I sometimes hear the question how to implement cache invalidation for the CDN. Or the question is why AEM CS still operates with a dispatcher layer when it now has a more powerful CDN in front of it.

The questions are very different, but the answer is in both cases: the CDN is no replacement for the dispatcher, and the dispatcher does not replace the CDN. They serve different purposes, and they combination of these two can be a really good package. Let me explain this.

The dispatcher is very traditional cache. It’s fronting the AEM systems and the cache status is actively maintained by cache invalidation so it always delivers current data. But from an end-user perspective this cache is often far away in terms of network latency. If my AEM systems are hosted in Europe, and end-users from Australia are reaching it, the latency can get huge.

The CDN is the contrary, it serves the content from many locations across the world, being as close to the end-user as possible. But the CDN cache invalidation is cumbersome, and for that reason most often TTL-based expiration is used. That means, you have to accept that there is a chance, that new content is already available, but the CDN can still deliver old content.

Not everyone is happy with that; and if that’s a real concern, short TTLs (in the range of a few minutes) are the norm. That means, that many files on the CDN will get stale every few minutes, which results in cache misses; and a cache miss on the CDN goes back to origin. But of course the reality is, that not many pages change every 10 minutes; actually very few. But customers want to have that low TTL just in case a page was changed, and that change needs to get visible to all endusers as soon as possible. .

So you have a lot of cache misses on the CDN, which trigger a re-fetch of the file from origin, and and because many of the files have not changed, you refetch the exactly same binary which got stale seconds ago. Actually a waste of resources, because your origin system delivers the same content over and over again to the CDN a consequence of these misses. So you could keep your AEM instances busy all the time, re-rendering the same requests over and over, always creating the same response.

Introducing the dispatcher caching, fronting the actual AEM instance. If the file has not changed, the dispatcher will deliver the same file (or just HTTP 304 not modified, which even avoids sending the content again). And it’s fast, much faster than letting AEM rendering the same content again. And if the file has actually changed, it’s rendered once and then reused for all the future CDN cache misses.

The combination of these 2 types of caching approaches help you to deliver content from the edge while at the same time having a reasonable latency for content updates (that means the time between replicating a change to the publish instances until all users across the world can see it) without the need to have a huge number of AEM instances in the background.

So as a conclusion, using the CDN and the dispatcher cache is a good combination, if setup properly.

Performance tests modelling (part 3)

This is post 3 in my series about Performance Test Modelling. See the first post for an overview of this topic.

In the previous 2 posts I discussed the importance of having a clearly defined model of the performance tests, and that a good definition the load factors (typically measured by “concurrent users”) is required to build a realistic test.

In this post I cover the influence of the test system and test data on the performance test and its result, and why you should spend effort to create a test with a realistic set of data/content. In this post we will do a few thought experiments, and to judge the results of each experiment, we will use the cache-hit ratio of a CDN as a proxy metric.

Let’s design a performance test for a very simple site: It just consists of 1 page, 5 images and 1 CSS and 1 JS file; 8 files in total. Plus there is a CDN for it. So let’s assume that we have to test with 100, 500 and 1000 concurrent users. What’s the test result you expect?

Well, easy. You will get the same test result for all tests irrespective of the level of concurrency; mostly because after the first requests a files will be delivered from the CDN. That means no matter with what concurrency we test, the files are delivered from the CDN, for which we assume it will always deliver very fast. We do not test our system, but rather the CDN, because the cache hit ratio is quite close to 100%.

So what’s the reason why we do this test at all, knowing that the tests just validate the performance promises of the CDN vendor? There is no reason for it. The only reason why we would ever execute such a test is that on test design we did not pay attention to the data which we use to test. And someone decided that these 7 files are enough for satisfy the constraints of the performance test. But the results do not tell us anything about the performance of the site, which in production will consists of tens of thousands of distinct files.

So let’s us do a second thought experiment, this time we test with 100’000 files, 100 concurrent users requesting these files randomly, and a CDN which is configured to cache files for 8 hours (TTL=8h). With regard to to chache-hit-ratio, what is the expectation?

We expect that the cache-hit ratio starts low for quite some time, this is the cache-warming phase. And then it starts to increase, but it will never hit 100%, as after some time cache entries will expire on the cache and start produce cache-misses. This is a much better model of reality, but it still has a major flaw: In reality, requests are not randomly distributed, but normally there are hotspots.

A hotspot consists of files, which are requested much more often than average. Normally these are homepages or other landing pages, plus other pages which users normally are directed to. This set of files is normally quite small compared to the total amount of files (in the range of 1-2%), but they make up 40-60% of the overall requests, and you can easily assume a Pareto distribution (the famous 80/20 rule), that 20% of the files were responsible for 80% of the requests. That means we have a hotspot and a long-tail distribution of the requests.

If we modify the same performance test to take that distribution into account, we end up with a higher cache-hit ratio, because now the hotspot can be delivered mostly from the CDN. But on the long-tail we will have more cache-misses, because they are requested that rarely, so they can expire on the CDN without being requested again. But in total the cache-hit ratio will be better than with the random distribution, especially on the often-requested pages (which are normally the ones we care about most).

Let’s translate this into a graph which displays the response time.

This test is now quite realistic, and if we only focus on the 95 percentile (p95; that means if we take 100 requests, 95 of them are faster than this) the result would meet the criteria; but beyond that the response time is getting higher, because there are a lot of cache misses.

This level of realism in the test results comes with a price: Also the performance test model and the test preparation and execution are much more complicated now.

And till now we only considered users, but what happens when we add random internet noise and the search engines (the unmodelled users from the first part of this series) into the scenario? These will add more (relative) weight to the long-tail, because these requests do not necessarily follow the usual hotspots, but we have to assume a more random distribution for these.

That means that then the cache-hit ratio will be lower again, as there will be much more cache-misses now; and of course this will also increase the response time of the p95. And: it will complicate the model even further.

So let’s stop here. As I have outlined above, the most simple model is totally unrealistic, but making it more realistic makes the model more complex as well. And at some point the model is no longer helpful, because we cannot transform it into a test setup without too much effort (creating test data/content, complex rules to implement the random and hotspot-based requests, etc). That means especially in the case of the test data and test scenarios we need to find the right balance. The right balance between the investment we want to make into tests and how close it should mirror the reality.

I also tried to show you, how far you can get without doing any kind of performance test. Just based on some assumptions were able to build a basic understanding how the system will behave, and how some changes of the parameters will affect the result. I use this technique a lot and it helps me to quickly refine models and define the next steps or the next test iteration.

In part 4 I discuss various scenarios which you should consider in your performance test model, including some practical recommendations how to include them in your test model.

Performance tests modelling (part 2)

This is is the second blog post in the series about performance test modelling. You can find the overview over this series and links to all its articles in the post “Performance tests modelling (part 1)“.

In this blog post I want to cover the aspect of “concurrent users”, what it means in the context of a performance test and why its important to clearly understand its impact.

Concurrent users is an often used measure to indicate the the load put to a system, expressed by usage in a definition, how many users are concurrently using that system. And for that reason many performance tests provide as quantitative requirement: “The system should be able to handle 200 concurrent users”. While that seems to be a good definition on first sight, it leaves many questions:

What does “concurrent” mean?
And what does “user” mean?
Are “200 concurrent users” enough?
Do we always have “200 concurrent users”?

Definition of concurrent

Let’s start with the first question: What does “concurrent” really mean on a technical level? How can we measure that our test indeed does “200 concurrent users” and not just 20 or 1000?

Are there any server-side sessions which we can count and which directly give this number? And that we setup our test in a way to hit that number?
Or do we have to rely on more vague definitions like “users are considered concurrent when they do a page load less than 5 minutes apart”? And that we design our test in that way?

Actually it does not matter at all, which definition you choose. It’s just important that you explicitly define which definition you use. And what metric you choose to understand that you hit that number. This is an important definition when it comes to implementing your test.

And as a side-note: Many commercial tools have their own definition of concurrent, and here the exact definition does not matter as well, as long as you are able to articulate it.

What is a user?

The next question is about “the user” which is modeled in the test; to simplify the test and test executions one or more “typical” user personas are created, which visit the site and perform some actions. Which is definitely helpful, but it’s just that: A simplification, because otherwise our model would explode because of the sheer complexity and variety of user behavior. Also sometimes we don’t even know what a typical “user” does on our site, because that system will be brand-new.

So this is a case, where we have a huge variance in the behavior of the users, which we should outline in our model as a risk: The model is only valid if the majority of the users are behaving more or less as we assumed.

But is this all? Are really all users do at least 10% of the actions we assume they do?

Let’s brainstorm a bit and try to find answers for these questions:

Does the google bot behave like that? All the other bots of the search engines?
What about malware scanners which try to hit a huge list of WordPress/Drupal/… URLs on your site?
Other systems performing (random?) requests towards your site?

You could argue, that this traffic has less/no business value, and for that reason we don’t test for it. Also it could be assumed that this is just a small fraction of the overall user traffic, and can be ignored. But that is just an assumption, and nothing more. You just assume that it is irrelevant. But often these requests are not irrelevant, not all all.

I encountered cases where not the “normal users” were bringing down a system, but rather this non-normal type of “user”. An example for that are cases where the custom 404 handler was very slow, and for that reason the basic undocumented assumption “We don’t need to care about 404s, as they are very fast” was violated and brought down the site. All performance tests passed, but the production system failed nevertheless.

So you need to think about “user” in a very broad sense. And even if you don’t implement the constant background noise of the internet in your performance test, you should list it as factor. If you know that a lot of this background noise will trigger a HTTP statuscode 404, you are more likely to check that this 404 handler is fast.

Are “200 concurrent users” enough?

One information every performance has is the number of concurrent users which the system must be able to handle. But even if we assume, that “concurrent” and “users” are both defined as well, is this enough?

First, on what data is this number based on? Is it a number based on data derived from another system, which the new system should replace? That’s probably the best data you can get. Or when you build a new system, is it based on good marketing data (which would be okay-ish), based on assumptions of the expected usage or just numbers we would like to see (because we assume that a huge number of concurrent users means a large audience and a high business value)?

So probably this is the topic which will be discussed the most. But the number and the way how that number is determined should be challenged and vetted. Because it’s one the corner-stones of the whole performance test model. It does not make sense to build a high performance and scalable system when afterwards you find out that the business numbers we grossly overrated, and a smaller and cheaper solution would have delivered the same results.

What about time?

A more important is aspect which is often overlooked is the timing; how many users are working on the site at every moment? Do you need to expect the maximum number 8 hours every day or just during the peak days of the year? Do you have a more or less constant usage or only during business hours in Europe?

This heavily depends on the type of your application and the distribution of your audience. If you build an intranet site for a company only located in Europe, the usage during the night is pretty much “zero”, and it will start to increase at 0600 in the morning (probably the Germans going to work early :-)), hitting the max usage between 09 and 16 o’clock and going to zero at latest at 22 o’clock. The contrast to it is a site visited world-wide by customers, where we can expect a higher and almost flat line; of course with variations depending on the number of people being up.

This influences your tests as well, because in both cases you don’t need to simulate spikes, that means a 500% increase of users within 5 minutes. On the other hand, if you plan for large marketing campaigns addressing millions of users, this might exactly be the situation you need to plan and test for. Not to mention if you book a slot during the Superbowl break.

Why is this important? Because you need to test only scenarios which you will expect to see in production. And ignore scenarios which we don’t have any value for you. For example it’s a waste of time and investment to test for a sudden spike in the above mentioned intranet case for the European company, while it’s essential for marketing campaigns to test a scenario, where such a spike comes on top of the normal traffic.

Summary

“N concurrent users” itself is not much information; and while it can serve as input, your performance test model should contain a more detailed understanding of that definition and what it means to the performance test. Otherwise you will focus just on a given number of users of this idealistic type and ignore every other scenario and case.

In the part 3 I cover how the system and the test data itself will influence the result of the performance test.

Performance tests modelling (part 1)

In my last blog post about performance test I outlined best practices about building and executing a performance test with AEM as a Cloud Service. But intentionally I left out a huge aspect of the topic:

How should your test look like?
What is a realistic test?
And what can a test result tell you about the behavior of your production environment?

These are hard question, and I often find that these questions are not asked. Or people are not aware that these questions should be asked.

This is the first post in a series of blog posts, in which I want to dive a bit deeper into performance testing in the context of AEM and AEM CS (and many aspects can probably get generalized to other web applications as well). Unlike my other blog posts it addresses topics on a higher level (I will not refer to any AEM functionality or API, and won’t even mention AEM that often), because I learned over time, that very often performance tests are done based on a lot of assumptions. And that it’s very hard to discuss the details of a performance tests if these assumptions are not documented explicitly. I had such discussions in these 2 contexts:

The result of a performance test (in AEM as a Cloud Service) is poor and the customer wants to understand what Adobe will do.
After golive severe performance problems show up on production; and the customer wants to understand how this can happen as their tests showed no problems.

As you can imagine, if I am given just a few diagrams with test results and test statistics as preparation for this call with the customer … this is not enough, and very often more documentation about the tests is not available. Which often leads to a lot of discussions about some very basic things and that adds even more delay to an already late project and/or bad customer experience. So you can also consider this blog series as a kind of self-defense. If you were asked to read this post, now you know 🙂

I hope that this series will also help you improve your way of doing performance tests, so we all will have less of these situations to deal with.

This post series consists of these individual posts:

Part 1 (this post): What is a performance test? Why do we have these tests at all?
Part 2: What is the relevance of “number of concurrent users”? What does it even mean?
Part 3: What is the impact of test system and the test data on the result of the test?
Part 4: What scenarios should be covered by a performance test
Part 5: What are your options if the test results do not match the observations in the production environment?

And a word upfront to the term “performance test”: I summarize a number of different tests types under that term, which are executed with different intentions, and which come with many names: “Performance tests”, “Load tests”, “Stress tests”, “Endurance tests”, “Soak tests”, and many more. Their intention and execution differ, but in the end they can all benefit from the same questions which I want to cover this blog series. So if you read “performance test”, all of these other tests are meant as well.

What is a performance test? And why do we do them?

A performance test is a tool to predict the future, more specifically how a certain system will behave in a more-or-less defined scenario.

And that outlines already two problems which performance tests have.

It is a prediction of the future. Unlike a science experiment it does not try to understand the presence and extrapolate into the future. It does not the have same quality as “tomorrow we will have a sunrise, even if the weather is clouded”, but rather goes into the direction of “if my 17 year old son wants to celebrate his birthday party with his friends at home, we better plan a full cleaning of the house for the day after”. That means no matter how well you know your son and his friends (or the system you are building), there is still an element of surprise and unknown in it.
The scenario which we want to simulate is somehow “defined”. In quotes, because in many cases the definitions of that scenario are pretty vague. We normally base these definitions on previous experience we have made and some best practices of the industry.

So it’s already clear from these 2 items, that this prediction is unlikely to be exact and 100% accurate. But it does not need to be accurate, it just needs to be helpful.

A performance test is helpful if it delivers better results than our gut feeling; and the industry has learned that our gut feeling is totally unreliably when it comes to the behaviour of web applications under production load. That’s why many enterprise golive procedures require a performance tests, which will always deliver a more reliable result as gut feeling. But just creating and executing a performance test does not make this a helpful performance test.

So a helpful performance test is also a test, which mimics the reality close enough, that you don’t need to change your plans immediately after your system goes live and hits reality. Unfortunately you only know if your performance test was helpful after you went live. It shares this situation with other test approaches as well; for example a 100% unittest coverage does not mean, that your code does not have bugs, it’s just less likely.

What does that mean for performance tests and their design?

First, a performance test is based on a mental model of your system and the to-be reality, which must be documented. All its assumptions and goals should be explicitly documented, because only then a review can be done. And a review helps to uncover blind spots in our own mental model of the system, its environment and the way how it is used. It helps to clearly outline all known factors which influence the test execution and also its result.

Without that model, it is impossible to compare the test result with reality and try to understand which factor or aspect in the test was missing, misrepresented or not fully understood, which lead to a gap between test result and reality. If you don’t have a documented model, it’s possible to question everything, starting from the model to the correct test execution and the results. If you don’t have a model, the result of a performance test is just a PDF with little to no meaning.

Also you must be aware that this mental model is a massive simplification, as it is impossible to factor in all aspects of the reality, also because the reality changes every day. You will change your application, new releases of AEM as a Cloud Service will be deployed, you add more content, and so on.

Your mental model will never be complete and probably also never be up-to-date, and that will be reflected in your performance test. . But if you know that, you can factor it in. For example you know that in 3 months time the number of content has doubled, and you can decide if it’s helpful to redo the performance test with changed parameters. It’s now a “known unknown”, and no more a “unknown unknown”. You can even decide to ignore factors, if they deem not relevant to you, but of course you should document it.

When you have designed and documented such a model, it is much easier to implement the test, execute the test and reason about the results. Without such a model, there is much more uncertainties in every piece of the test execution. It’s like developing software without a clear and shared understanding what exactly you want to develop.

That’s enough for this post. As promised, this is more abstract than usual, but I hope you liked it and it helps to improve your tests. In part 2 I look into a few relevant aspects which should be covered by your model.