My top 3 reasons why page rendering is slow

In the past years I was engaged in many performance tuning activities, which related mostly to slow page rendering on AEM publish instances. Performance tuning on authoring side is often different and definitely much harder

And over the time I identified 3 main types of issues, which make the page rendering times slow. And slow page rendering can be hidden by caching, but at some point the page needs to be rendered, and often it makes a difference if this process takes 800ms or 5 seconds. Okay, so let’s start.

Too many components

This is a pattern which I see often in older codebases. Often pages are assembled out of 100+ components, very often in deep nesting. My personal record I have seen were 400 components, nested in 10 levels. This normally causes problems in the authoring UI because you need to very careful to select the correct component and its parent or a child container.

The problem on the page rendering process is the overhead of each component. This overhead consists of the actual include logic and then all the component-level filters. While each inclusion and each component does not take much time, the large number of components cause the problem.

For that reason: Please please reduce the number of components on your page. Not only the backend rendering time, but also the frontend performance (less javascript and CSS rules to evaluate) and the authors experience will benefit from it.

Slow Sling models

I love Sling Models. But they can also hide a lot of performance problems (see my series about optimizing Sling Models), and thus can be a root-cause for performance problems. In the context of page rendering and Sling Models backing HTL scripts, the problem are normally not the annotations (see this post), but rather the complex and time-consuming logic when the models are instantiated, most specifically the problems with executing the same logic multiple times (as described in my earlier post “Sling Model Performance (Part 4)“).

External network connections

This pattern requires that during page rendering a synchronous call is done towards a different system; and while this request is executed the rendering thread on the AEM side is blocked. This can turn into problems if the backend is either slow or not available. Unfortunately this is the hardest case to fix, because removing this often requires a re-design of the application. Please see also my post about “Do not use AEM as a proxy for backend calls” for this; it contains a few recommendations how to avoid at least some of the worst aspects, for example using proper timeouts.

Thoughts on performance testing on AEM CS

Performance is an interesting topic on its own, and I already wrote a bit about it in this blog (see the overview). I have not written yet about performance testing in the context of AEM CS. It’s not that it is fundamentally different, but there are some specifics, which you should be aware of.

Perform your performance tests on the Stage environment. The stage environment is kept at the same sizing as the production environment, so it should deliver the same behavior. and your PROD environment, if you have the same content and your test is realistic.
Use a warmup phase. As the Stage environment is normally downscaled to the minimum (because there is no regular traffic), it can take a bit of time until it has upscaled (automatically) to the same number of instances as your PROD is normally operating with. That means that your test should have a proper warmup period, during you which increase the traffic to the normal 100% level of production. This warmup phase should take at least 30 minutes.
I think that any test should take at least 60-90 minutes (including warmup); even if you see early that the result is not what you expect to be, there is often something to learn even from such incorrect/faulty situations. I had the case that a customer was constantly terminating the after about 20-25 minutes, claiming that something was not working server-side as they expected it to be. Unfortunately the situation has not yet settled, so I was not able to get any useful information from the system.
AEM CS comes with a CDN bundled to the environment, and that’s the case also for the Stage environment. But that also means that your performance test should contain all requests, which you expect to be delivered from the CDN. This is important because it can show if your caching is working as intended. Also only then you can assess the impact of the cache misses (when files expire on the CDN) on the overall performance.
While you are at it, you can run a stage pipeline during the performance test and deploy new code. You should not see any significant change in performance during that time.
Oh yes, also do some content activations in that time. That makes your test much more realistic and also reveal potential performance problems when updating content (e.g. because you constantly invalidate the complete dispatcher cache).
You should focus on a large content set when you do the performance test. If you only test a handful of pages/assets/files, you are mostly testing caches (at all levels).
“Campaign-traffic” is rarely tested. This is traffic, which has some query strings attached (e.g. “utm_source”, “gclid” and such) to support traffic attribution. These parameters are ignored while rendering, but they often bypass all caching layers, hitting AEM. And while a regular performance test only tests without these paramters, if you marketing department runs a facebook campaign, the traffic from that campaign looks much different, and then the results of your performance tests are not valid anymore.

Some words as precaution:

A performance test can look like a DOS, and your requests can get blocked for that reason. This can happen especially if these requests are originating from a single source IP. For that reason you should distribute your load injector and use multiple source IP addresses. In case you still get blocked, please contact support so we can adapt accordingly.
AEM CS uses an affinity cookie to indicate that requests of a user-agent are handled by a specific backend system. If you use the same affinity cookie throughout all your performance tests, you just test a single backend system; and that effectively disables any loadbalancing and renders the performance test results unusable. Make sure that you design your performance tests with that in mind.

I general I prefer it much if I can help you during the performance phase, than to handle escalations for of bad performance and potential outages because of it. I hope that you think the same way.

Identify repository access

Performance tuning in AEM is typically a tough job. The most obvious and widely known aspect is the tuning of JCR queries, but that’s all; if your code is not doing any JCR query and still slow, it’s getting hard. For requests my standard approach is to use “Recent requests” and identify slow components, but that’s it. And then you have threaddumps, but these are hardly helping here. There is no standard way to diagnose further without relying on gut feeling and luck.

When I had to optimize a request last year, I thought again about this problem. And I asked myself the question:
Whenever I check this request in the threaddumps, I see the code accessing the repository. Why is this the case? Is the repository slow or is it just accessing the repository very frequently?

The available tools cannot answer this question. So I had to write myself something which can do that. In the end I committed it to the Sling codebase with SLING-11654.

The result is an additional logger, (“org.apache.sling.jcr.resource.AccessLogger.operation” on loglevel TRACE) which you can enable and which can you log every single (Sling) repository access, including the operation, the path and the full stacktrace. That is a huge amount of data, but it answered my question quite thoroughly.

The repository is itself is very fast, because a request (taking 500ms in my local setup) performs 10’000 times a repository access. So the problem is rather the total number of repository access.
Looking at the list of accessed resources it became very obvious, that there is a huge number of redundant access. For example these are the top 10 accessed paths while rendering a simple WKND page (/content/wknd/language-masters/en/adventures/beervana-portland):
- 1017 /conf/wknd/settings/wcm/templates/adventure-page-template/structure
- 263 /
- 237 /conf/wknd/settings/wcm/templates
- 237 /conf/wknd/settings/wcm
- 227 /content
- 204 /content/wknd/language-masters/en
- 199 /content/wknd
- 194 /content/wknd/language-masters/en/adventures/beervana-portland/jcr:content
- 192 /content/wknd/jcr:content
- 186 /conf/wknd/settings

But now with that logger, I was able to identify access patterns and map them to code. And suddenly you see a much bigger picture, and you can spot a lot of redundant repository access.

With that help I identified the bottleneck in the code, and the posting “Sling Model performance” was the direct result of this finding. Another result was the topic for my talk at AdaptTo() 2023; checkout the recording for more numbers, details and learnings.

But with these experiences I made an important observation: You can use the number of repository access as a proxy metric for performance. The more repository access you do, the slower your application will get. So you don’t need to rely so much on performance tests anymore (although they definitely have their value), but you can validate changes in the code by counting the number of repository access performed by it. Less repository access is always more performant, no matter the environmental conditions.

And with an additional logger (“org.apache.sling.jcr.AccessLogger.statistics” on TRACE) you can get just the raw numbers without details, so you can easily validate any improvement.

Equipped with that knowledge you should be able to investigate the performance of your application on your local machine. Looking forward for the results 🙂

(This is currently only available on AEM CS / AEM CS SDK, I will see to get it into an upcoming AEM 6.5 servicepack.)

The Explain Query tool

When there’s a topic which has been challenging forever in the AEM world, then it’s JCR queries and indexes. It can feel like an arcane science, where it’s quite easy to mess up and end up with a slow query. I learned it also the hard way, and a printout of the JCR query cheatsheet is always below my keyboard.

But there were some recent changes, which made the work with query performance easier. First, in AEM CS the Explain Query tool has been added, which is also available via the AEM Developer Console. It displays queries, slow queries, number of rows read, the used index, execution plan etc. But even with that tool alone it’s still hard to understand what makes a query performant or slow.

Last week there was a larger update to the AEM documentation (thanks a lot, Tom!), which added a detailed explanation of the Explain Query tool. Especially it drills down into the details of the query execution plan and how to interpret it.

With this information and the good examples given there you should be able to analyze the query plan of your queries and optimize the indexes and queries before you execute them the first time in production.

Sling Model Performance

In my daily job as an SRE for AEM as a Cloud Service I often have to deal with performance questions, especially in the context of migrations of customer applications. Applications sometimes perform differently on AEM CS than they did on AEM 6.x, and a part of my job is to look into these cases.

This often leads to interesting deep dives and learnings; you might have seen this reflected in the postings of this blog 🙂 The problem this time was a tight loop like this:

for (Resource child: resource.getChildren()) { SlingModel model = child.adaptTo(SlingModel.class); if (model != null && model.hasSomeCondition()) { // some very lightweight work } }

This code performed well with 1000 child resources in a AEM 6.x authoring instance, but quite poorly on an AEM CS authoring instance with the same number of child nodes. And the problem is not the large number of childnodes …

After wading knee-deep through TRACE logs I found the problem at an unexpected location. But before I present you the solution and some recommendations, let me you explain some background. But of course you can skip the next section and jump directly to the TL;DR at the bottom of this article.

SlingModels and parameter injection

One of the beauties of Sling Models is that these are simple PoJos, and properties are injected by the Sling Models framework. You just have to add matching annotations to mark them accordingly. See the full story in the official documentation.

The simple example in the documentation looks like this:

@Inject String title;

which (typically) injects the property named “title” from the resource this model was adapted from. The same way you can inject services, child-nodes any many other useful things.

To make this work, the framework uses an ordered list of Injectors, which are able to retrieve values to be injected (see the list of available injectors). The first injector which returns a non-null value is taken and its result is injected. In this example the ValueMapInjector is supposed to return a property called “title” from the valueMap of the resource, which is quite early in the list of injectors.

Ok, now let’s understand what the system does here:

@Inject @Optional String doesNotExist;

Here a optional field is declared, and if there is no property called “doesNotExist” in the valueMap of the resource, other injectors are queried if they can handle that injection. Assuming that no injector can do that, the value of the field “doesNotExist” remains null. No problem at first sight.

But indeed there is a problem, and it’s perfomance. Even the lookup of a non-existing property (or node) in the JCR takes time, and doing this a few hundred or even thousand times in a loop can slow down your code. And a slower repository (like the clustered MongoDB persistence in the AEM as a Cloud Service authoring instances) even more.

To demonstrate it, I wrote a small benchmark (source code on my github account), which does a lot of adaptions to Sling Models. When deployed to AEM 6.5.5 or later (or a recent version of the AEM CS SDK) you can run it via curl -u admin:admin http://localhost:4502/bin/slingmodelcompare

This is its output:

de.joerghoh.cqdump.performance.core.models.ModelWith3Injects: single adaption took 18 microseconds de.joerghoh.cqdump.performance.core.models.ModelWith3ValueMaps: single adaption took 16 microseconds de.joerghoh.cqdump.performance.core.models.ModelWithOptionalValueMap: single adaption took 18 microseconds de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalValueMaps: single adaption took 20 microseconds de.joerghoh.cqdump.performance.core.models.ModelWithOptionalInject: single adaption took 83 microseconds de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalInjects: single adaption took 137 microseconds

It’s a benchmark which on a very simple list of resources tries adaptions to a number of Model classes, which are different in their type of annotations. So adapting to a model which injects 3 properties takes approximately 20 microseconds, but as soon as a model has a failing injection (which is declared with “@Optional” to avoid failing the adaption), the duration increases massively to 83 microseconds, and even 137 microseconds when 2 these failed injections are there.

Ok, so having a few of such failed injections do not make a problem per se (you could do 2’000 within 100 milliseconds), but this test setup is a bit artificial, which makes these 2’000 a really optimistic number:

It is running on a system with a fast repository (SDK on my M1 Macbook); so for example the ChildResourceInjector does not has almost no overhead to test for the presence of a childResource called “doesNotExist”. This can be different, for example on AEM CS Author the Mongo storage has a higher latency than the segmentStore on the SDK or a publish. If that (non-existing) child-resource is not in the cache, there is an additional latency in the range of 1ms to load that information. What for? Well, basically for nothing.
The OsgiInjector is queried as well, which tries to access the OSGI ServiceRegistry; this registry is a central piece of OSGI, and it’s consistency is heavily guarded by locks. I have seen this injector being blocked by these locks, which also adds latency.

That means that these 50-60 microseconds could easily multiply, and then the performance is getting a problem. And this is the problem which initially sparked this investigation.

So what can we do to avoid this situation? That is quite easy: Do not use @Inject, but use the specialized injectors directly (see them in the documentation). While the benefit is probably quite small when it comes to properties which are present (ModelWith3Injects tool 18 microseconds vs 16 microseconds of ModelWith3ValueMaps), the different gets dramatic as soon as we consider failed injections:

1 failed invocation: 18 microseconds (ModelWithOptionalValueMap) vs 83 microseconds (ModelWithOptionalInject)
2 failed invocations: 20 microseconds (ModelWith2OptionalValueMaps) vs 137 microseconds (ModelWith2OptionalInjects)
And with every more failed injections for that model the penalty will increase by another 50-60 microseconds.

Even in my local benchmark the improvement can be seen quite easily, there is almost no overhead of such a failed injection, if I explicitly mark them as Injection via the ValueMapInjector. And as mentioned, this overhead can be even larger in reality.

Still, this is a micro-optimization in the majority of all cases; but as mentioned already, many of these optimizations implemented definitely can make a difference.

TL;DR Use injector-specific annotations

Instead of @Inject use directly the correct injector. You normally know exactly where you want that injected value to come from.
And by the way: did you know that the use of @Inject is discouraged in favor of these injector-specific annotations?

Update: The Sling Models documentation has been updated and explicitly discourages the use of @Inject now.

How to deal with the “TooManyCallsException”

I randomly see the question “We get the TooManyCallsException while rendering pages, and we need to increase the threshold for the number of inclusions to 5000. Is this a problem? What can we do so we don’t run into this issue at all?”

Before I answer this question, I want to explain the background of this setting, why it was introduced and when such a “Call” is made.

Sling rendering is based on Servlets; and while a single servlet can handle the rendering of the complete response body, that is not that common in AEM. AEM pages normally consistent of a variety of different components, which internally can consist of distinct subcomponents as well. This depends on the design approach the development team has choosen.
(It should be mentioned that all JSPs and all HTL scripts are compiled into regular Java servlets.)

That means that the rendering process can be considered as tree of servlets, and servlets calling other servlets (with the DefaultGetServlet being the root of such a tree when rendering pages). This tree is structured along the resource tree of the page, but it can include servlets which are rendering content from different areas of the repository; for example when dealing with content fragments or including images, which require their metadata to be respected.

It is possible to turn this tree into a cyclic graph; and that means that the process of traversing this tree of servlets will turn into a recursion. In that case request processing will never terminate, the Jetty thread pool will quickly fill up to its limit, and the system will get unavailable. To avoid this situation only a limited number of servlet-calls per request is allowed. And that’s this magic number of 1000 allowed calls (which is configured in the Sling Main Servlet).

Knowing this let me try to answer the question “Is it safe to increase this value of 1000 to 5000?“. Yes, it is safe. In case your page rendering process goes recursive it terminates later, which will increase a bit the risk of your AEM instance getting unavailable.

“Are there any drawbacks? Why is the default 1000 and not 5000 (or 10000 or any higher value)?” From experience 1000 is sufficient for the majority of applications. It might be too low for applications where the components are designed very granular which in turn require a lot of servlet calls to properly render a page.
And every servlet call comes with a small overhead (mostly for running the component-level filters); and even if this overhead is just 100 microseconds, 1000 invocations are 100 ms just for the invocation overhead. That means you should find a good balance between a clean application modularization and the runtime performance overhead of it.

Which leads to the next question: “What are the problematic calls we should think of?“. Good one.
From a high-level view of AEM page renderings, you cannot avoid the servlet-calls which render the components. That means that you as an AEM application developer cannot influence the overall page rendering process, but you can only try to optimise the rendering of individual (custom) components.
To optimise these, you should be aware, that the following things trigger the invocation of a servlet during page rendering:

the <cq:include>, <sling:include> and <sling:forward> JSP tags
the data-sly-include statement of HTL
and every method which invokes directly or indirectly the service() method of a servlet.

A good way to check this for some pages is the “Recent requests” functionality of the OSGI Webconsole.

AEM micro-optimization (part 4) – define allowed templates

This time I want to discuss a different type of micro-optimization. It’s not something you as a developer can implement in your code, but it’s rather a question of the application design, which has some surprising impact. I came across it when I recently investigated poor performance in the Siteadmin navigation. And although I did this investigation in AEM as a Cloud Service, the logic on AEM 6.5 behaves the same way.

When you click in the siteadmin navigation through your pages, AEM collects a lot of information about pages and folders to display them in the proper context. For example, when you click on page with child pages, it collects information what actions should be displayed if a specific child node is going to be selected (copy, paste, publish, …)

An important information is if the “Create page” action should be made available. And that’s the thing I want to outline in this article.

Assuming that you have the required write permissions on that folder, the most important is if templates are allowed to be created as children of the current page. The logic is described in the documentation and is quite complex.

In short:

On the content the template must be allowed (using the cq:allowedTemplates property (if present) AND
The template must be allowed to be used as a child page of the current page

Both conditions are must be met for a template to make it eligible to be used as a source for a new page. To display the entry “Page” it’s sufficient if at least 1 template is allowed.

Now let’s think about the runtime performance of this check, and that’s mostly determined by the total number of templates in the system. AEM determines all templates by this JCR query:

//jcr:content/element(*,cq:Template)

And that query returns 92 results on my local SDK instance with WKND installed. If we look a bit more closely to the results, we can determine 3 different types of templates:

Static templates
Editable templates
Content Fragment models

So depending on your use-case it’s easy to end up with hundreds of templates, and not all of them are applicable at the location you are currently in. In fact, typically just very few templates can be used to create a page here. That means that the check most likely needs to iterate a lot to eventually encounter a template which is a match.

Let’s come back to the evaluation if that entry should be displayed. If you have defined the cq:allowedTemplates property on the page or it’s ancestors it’s sufficient to check the templates listed there. Typically it’s just a handful of templates, and it’s very likely that you find a “hit” early on, which immediately terminates this check with a positive result. I want to explicitly mention that not every template listed can be created here, because there also other constraints (e.g. the parent template must be of a certain type etc) which must match.

If template A is allowed to be used below /content/wknd/en, then we just need to check the single Template A to get that hit. We don’t care, where in the list of templates it is (which are returned by the above query), because we know exactly which one(s) to look at.

If that property is not present, AEM needs to go through all templates and check the conditions for each and every one, until it finds that positive result. And the list of templates is identical to the order in which the templates are returned from the JCR query, that means the order is not deterministic. Also it is not possible to order the result in a helpful way, because the semantic of our check (which include regular expressions) cannot be expressed as part of the JCR query.

So you are very lucky if the JCR query returns a matching template already at position 1 of the list, but that’s very unlikely. Typically you need to iterate tens of templates to get a hit.

So, what’s the impact on the performance of this iteration and the checks? In an synthetic check with 200 templates, when I did not have any match, it took around 3-5ms to iterate and check all of the results.

You might ask, “I really don’t feel a 3-5ms delay”, but when the list view in siteadmin performs this check for up to 40 pages in a single request, it’s rather a 120-200 millisecond difference. And that is a significant delay for requests where bad performance is visible immediately. Especially if there’s a simple way to mitigate this.

And for that reason I recommend you to provide “cq:allowedTemplates” properties on your content structure. In many cases it’s possible and it will speed up the siteadmin navigation performance.

And for those, who cannot change that: I currently working on changing the logic to speedup the processing for the cases where no cq:allowedTemplates property is applicable. And if you are on AEM as a Cloud Service, you’ll get this improvement automatically.

AEM micro-optimization (part 1)

As a followup on the previous article I want to show you, how a micro-optimization can look like. My colleague Miroslav Smiljanic found that there is a significant difference in the time it takes to compute these statements (1) and (2).

Node node = …
Session session = node.getSession();
String parentPath = node.getParent().getPath();

Node p1 = node.getParent(); // (1)
Node p2 = session.getNode(parentPath); // (2)

assertEquals(p1,p2);

He did the whole writeup in the context of a suggested improvement in Sling, and proved it with impressive numbers.

Is this change important? Just by itself it is not, because going the resource/node tree upwards is not that common compared to going downwards the tree. So replacing a single call might yield only in an improvement of a fraction of a milisecond, even if the case (2) is up to 200 times faster than (1)!

But if we can replace the code in all cases where the getParent() can be used with the performant getParent() call, especially in the lowlevel areas of AEM and Sling, all areas might benefit from it. And then we don’t execute it only once per page rendering, but maybe a hundred times. And then we might end up with tens of miliseconds of improvement already, for any request!

And in special usecases the effect might be even higher (for example if your code is constantly traversing the tree upwards).

Another example of such an micro-optimization, which is normally quite insignificant but can yield huge benefits in special cases can be found in SLING-10269, where I found that a built-in caching of the isResourceType() results reduces the rendering times of some special requests by 50%, because it is done thousands of times.

Typically micro-optimizations have these properties:

In the general case the improvement is barely visible (< 1% improvement of performance)
In edge cases they can be a life saver, because they reduce execution time by a much larger percentage.

The interesting part is, that these improvements accumulate over time, and that’s where it is getting interesting. When you have implemented 10 of these in low-level routines the chances are high that your usecase benefits from it as well. Maybe by 10 times 0.5% performance improvement, but maybe also a 20% improvement, because you hit the sweet spot of one of these.

So it is definitely worth to pay attention to these improvements.

My recommendation for you: Read the entry in the Oak “Do’s and Don’ts” page and try to implement this learning in your codebase. And if you find more of such cases in the Sling codebase the community appreciates a ticket.

(Photo by KAL VISUALS on Unsplash)

The effect of micro-optimizations

Optimizing software for speed is a delicate topic. Often you hear the saying “Make it work, make it right, make it fast”, implying performance optimization should be the last step you should do when you code. Which is true to a very large extent.

But in many cases, you are happy if your budget allows to you to get to the “get it right” phase, and you rarely get the chance to kick off a decent performance optimization phase. That’s a situation which is true in many areas of the software industry, and performance optimization is often only done when absolutely necessary. Which is unfortunate, because it leaves us with a lot of software, which has performance problems. And in many cases a large part of the problem could be avoided if only a few optimizations were done (at the right spot, of course).

But all this statement of “performance improvement phase” assumes, that it requires huge efforts to make software more performant. Which in general is true, but there are typically a number of actions, which can be implemented quite easily and which can be beneficial. Of course these rarely boost your overall application performance by 50%, but most often it just speeds up certain operations. But depending on the frequency these operations are called it can sum into a substantial improvement.

I did once a performance tuning session on an AEM publish instance to improve the raw page rendering performance of an application. The goal was to squeeze more page responses out of the given hardware. Using a performance test and a profiler I was able to find the creation of JCR sessions and Sling ResourceResolvers to take 1-2 milliseconds, which was worth to investigate. Armed with this knowledge I combed through the codebase, reviewed all cases where a new Session is being created and removed all cases where it was not necessary. This was really a micro-optimization, because I focussed on tiny pieces of the code (not even the areas which are called many times) , and the regular page rendering (on a developer machine) was not improving at all. But in production this optimization turned out to help a lot, because it allowed us to deliver 20% more pages per second out of the publish at peak.

In this case I spend quite some amount of time to come to the conclusion, that opening sessions can be expensive under load. But now I know that and spread that knowledge via code reviews and blog posts.

Most often you don’t see the negative effect of these anti-patterns (unless you overdo it and every Sling Models opens a new ResourceResolver), and therefor the positive effects of applying these micro-optimizations are not immediately visible. And in the end, applying 10 micro-optimizations with a ~1% speedup each sum up to a pretty nice number.

And of course: If you can apply such a micro-optimization in a codepath which is heavily used, the effects can be even larger!

So my recommendation to you: If you come across such a piece of code, optimize it. Even if you cannot quantify and measure the immediate performance benefit of it, do it.

Same as:

(for int=0;i<= 100;i++) {
  othernumber += i;
}

I cannot quantify the improvement, but I know, that

othernumber += 5050;

is faster than the loop, no questions asked. (Although that’s a bad example, because hopefully the compiler would do it for me.)

In the upcoming blog posts I want to show you a few cases of such micro-optimizations in AEM, which I personally used with good success. Stay tuned.

(Photo by Michael Longmire on Unsplash)

Slow deployments on AEM 6.4/6.5

A recent post on the AEM forums challenged me to look into an issue I observed myself but did not investigate further.

The observation is that during deployments maintenance tasks are stopped and started a lot, and this triggers a lot of other activities, including a lot of healtcheck executions. This slows down the deployment times and also pollutes the logfiles during deployments.

The problem is that the AEM Maintenance TaskScheduler is supposed to react on changes on some paths in the repository (where the configuration is stored), but unfortunately it also reacts on any change of ResourceProviders (and every Servlet is implemented as a single ResourceProvider). And because this causes a complete reload/restart of the maintenance tasks (and some healthchecks as well), it’s causing quite some delay.

But this behaviour is controlled via OSGI properties, which are missing by default, so we can add them on our own 🙂

Just create a OSGI configuration for com.adobe.granite.maintenance.impl.TaskScheduler and add a single multi-value property named “resource.change.types” with the values “ADDED”, “CHANGED”, “REMOVED”.

(please also report this behavior via Daycare and refer to GRANITE-29609, so we hopefully get a fix for it, instead of applying this workaround).