Some years, even before AEM as a Cloud Services, the RepoInit language has been implemented as part of Sling (and AEM) to create repository structures directly on the startup of the JCR Repository. With it your application can rely that some well-defined structures are always available.
In this blog post I want to walk you through a way how you can test repoinit statements locally and avoid pipeline failures because of it.
Repoinit statements are deployed as part of OSGI configurations; and that means that during the development phase you can work in an almost interactive way with it. Also exceptions are not a problem; you can fix the statement and retry.
The situation is much different when you already have repoinit statements deployed and you startup your AEM (to be exact: the Sling Repository service) again. Because in this case all repoinit statements are executed as part of the startup of the repository. And any exception in the execution of repoinits will stop the startup of the repository service and render your AEM unusable. In the case of CloudManager and AEM as a Cloud Service this will break your deployment.
Let me walk you through 2 examples of such an exception and how you can deal with it.
*ERROR* [Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted java.lang.RuntimeException: Session.save failed: javax.jcr.nodetype.ConstraintViolationException: OakConstraint0025: /conf/site/configuration/favicon.ico[[nt:file]]: Mandatory child node jcr:content not found in a new node
at org.apache.sling.jcr.repoinit.impl.AclVisitor.visitCreatePath(AclVisitor.java:167) [org.apache.sling.jcr.repoinit:1.1.36]
at org.apache.sling.repoinit.parser.operations.CreatePath.accept(CreatePath.java:71)
In this case the exception is quite detailed what actually went wrong. It failed when saving, and it says that /conf/site/configuration/favicon (of type nt:file) was affected. The problem is that a mandatory child node “jcr:content” is missing.
Why is it a problem? Because every node of nodetype “nt:file” requires a “jcr:content” child node which actually holds the binary.
This is a case which you can detect very easily also on a local environment.
Which leads to the first recommendation:
When you develop in your local environment, you should apply all repoinit statements to a fresh environment, in which there are no manual changes. Because otherwise your repoinit statements rely on the presence of some things which are not provided by the repoinit scripts.
Having a mix of manual changes and repoinit on a local development environment and then moving it untested over is often leads to failures in the CloudManager pipelines.
The second example is a very prominent one, and I see it very often:
[Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted java.lang.RuntimeException: Failed to set ACL (java.lang.UnsupportedOperationException: This builder is read-only.) AclLine DENY {paths=[/libs/cq/core/content/tools], privileges=[jcr:read]}
at org.apache.sling.jcr.repoinit.impl.AclVisitor.setAcl(AclVisitor.java:85)
It’s the well-known “This builder is read-only” version. To understand the problem and its resolution, I need to explain a bit the way the build process assembles AEM images in the CloudManager pipeline.
In AEM as a cloud service you have an immutable part of the repository, which consists out of the trees “/libs” and “/apps”. They are immutable, because they cannot be modified on runtime, not even with admin permissions.
During build time this immutable part of the image is built. This process merges both product side parts (/libs) and custom application parts (/apps) together. After that also all repoinit scripts run, both the ones provided by the product as well as any custom one. And of course during that part of the build these parts are writable, thus writing into /apps using repoinit is not a problem.
So why do you actually get this exception, when /libs and /apps are writeable? This is because repoinit is executed a second time. During the “final” startup, when /apps and /libs are immutable.
Repoinit is designed around that idea, that all activities are idempotent. This means that if you want to create an ACL on /apps/myapp/foo/bar the repoinit statement is a no-op if that specific ACL already exists. A second run of repoinit will do nothing, but find everything still in place.
But if in the second run the system executes this action again, it’s not an no-op anymore. This means that this ACL is not there as expected. Or whatever the goal of that repoinit statement was.
And there is only one reason why this happen. There was some other action between these 2 executions of repoinit which changed the repository. The only thing which also modifies the repository are installations of content packages.
Let’s illustrate this problem with an example. Imagine you have this repoinit script:
create path /apps/myapp/foo/bar
set acl on /apps/myapp/foo/bar
allow jcr:read for mygroup
end
And you have a content package which comes with content for /apps/myapp and the filter is set to “overwrite”, but not containing this ACL.
In this case the operations leading to this error are these:
Repoinit sets the ACL on /apps/myapp/foo/bar
the deployment overwrites /apps/myapp with the content package, so the ACL is wiped
AEM starts up
Repoinit wants to set the ACL on /apps/myapp/foo/bar, which is now immutable. It fails and breaks your deployment.
The solution to this problem is simple: You need to adjust the repoinit statements and the package definitions (especially the filter definitions) in a way, that the package installation does not wipe and/or overwrite any structure created by repoinit. And with “structure” I do not refer only to nodes, but also nodetypes, properties etc. All must be identical, and in the best case they don’t interfere.
It is hard to validate this locally, as you don’t have an immutable /apps and /libs, but there is a test approach which comes very close to it:
Run all your repoinit statements in your local test environment
During the second run of the repoinit statements you should not see any write in the trace log. If you have any write operation, it’s a sign that your packages overwrite structures created by repoinit. You should fix these asap, because they will later break your CloudManager pipeline.
With this information at hand you should be able to troubleshoot any repoinit problems already on your local test environment, avoiding pipeline failures because of it.
Sling events are used for many aspects of the system, and initially JCR changes were sent with it. But the OSGI eventing (which the Sling events are built on-top) are not designed for a huge volumes of events (thousands per second); and that is a situation which can happen with AEM; and one of the most compelling reasons to get away from this approach is that all these event handlers (both resource change event and all others) share a single thread-pool.
For that reason the ResourceChangeListeners have been introduced. Here each listener provides detailed information which change it is interested in (restrict by path and type of the change) therefor Sling is able to optimise the listeners on the JCR level; it does not listen for changes when no-one is interested in. This can reduce the load on the system and improve the performance. For this reason the usage of OSGI Resource Event listeners are deprecated (although they are still working as expected).
How can I find all the ResourceChangeEventListeners in my codebase?
That’s easy, because on startup for each of these ResourceChangeEventListeners you will find a WARN message in the logs like this:
Found OSGi Event Handler for deprecated resource bridge: com.acme.myservice
This will help you to identify all these listeners.
How do I rewrite them to ResourceChangeListeners?
In the majority of cases this should be straight-forward. Make your service implement the ResourceChangeListeners interface and provide these additional OSGI properties:
Error handling is a topic which developers rarely pay too much attention to. It is done when the API forces them to handle an exception. And the most common pattern I see is the “log and throw” pattern, which means that the exception is logged and then re-thrown.
When you develop in the context of HTTP requests, error handling can get tricky. Because you need to signal the consumer of the response, that an error happened and the request was not successful. Frameworks are designed in a way that they handle any exception internally and set the correct error code if necessary. And Sling is not different from that, if your code throws an exception (for example the postConstruct of a Sling Model), the Sling framework catches it and sets the correct status code 500 (Internal Server Error).
I’ve seen code, which catches exception itself and sets the status code for the response itself. But this is not the right approach, because every exception handled this way the developers implicitly states: “These are my exceptions and I know best how to handle them”; almost as if the developer takes ownership of these exceptions and their root causes, and that there’s nothing which can handle this situation better.
This approach to handle exceptions on its own is not best practice, and I see 2 problems with it:
Setting the status code alone is not enough, but the remaining parts of the request processing need to stopped as well. Otherwise the processing continues as nothing happened, which is normally not useful or even allowed. It’s hard to ensure this when the exception is caught.
Owning the exception handling removes the responsibility from others. In AEM as a Cloud Service Adobe monitors response codes and the exceptions causing it. And if there’s only a status code 500 but no exception reaching the SlingMainServlet, then it’s likely that this is ignored, because the developer claimed ownership of the exception (handling).
If you write a Sling Servlet or code operating in the context of a request it is best practice not to catch exceptions, but to let them bubble up to the Sling Main Servlet, which is able to handle it appropriately. handle exceptions by yourself, only if you have a better way to deal with them as to log them.
I randomly see the question “We get the TooManyCallsException while rendering pages, and we need to increase the threshold for the number of inclusions to 5000. Is this a problem? What can we do so we don’t run into this issue at all?”
Before I answer this question, I want to explain the background of this setting, why it was introduced and when such a “Call” is made.
Sling rendering is based on Servlets; and while a single servlet can handle the rendering of the complete response body, that is not that common in AEM. AEM pages normally consistent of a variety of different components, which internally can consist of distinct subcomponents as well. This depends on the design approach the development has choosen. (It should be mentioned that all JSPs and all HTL scripts are compiled into regular Java servlets.)
That means that the rendering process can be considered as tree of servlets, and servlets calling other servlets (with the DefaultGetServlet being the root of such a tree when rendering pages). This tree is structured along the resource tree of the page, but it can include servlets which are rendering content from different areas of the repository; for example when dealing with content fragments or including images, which require their metadata to be respected.
It is possible to turn this tree into a cyclic graph; and that means that the process of traversing this tree of servlets will turn into a recursion. In that case request processing will never terminate, the Jetty thread pool will quickly fill up to its limit, and the system will get unavailable. To avoid this situation only a limited number of servlet-calls per request is allowed. And that’s this magic number of 1000 allowed calls (which is configured in the Sling Main Servlet).
Knowing this let me try to answer the question “Is it safe to increase this value of 1000 to 5000?“. Yes, it is safe. In case your page rendering process goes recursive it terminates later, which will increase a bit the risk of your AEM instance getting unavailable.
“Are there any drawbacks? Why is the default 1000 and not 5000 (or 10000 or any higher value)?” From experience 1000 is sufficient for the majority of applications. It might be too low for applications where the components are designed very granular which in turn require a lot of servlet calls to properly render a page. And every servlet call comes with a small overhead (mostly for running the component-level filters); and even if this overhead is just 100 microseconds, 1000 invocations are 100 ms just for the invocation overhead. That means you should find a good balance between a clean application modularization and the runtime performance overhead of it.
Which leads to the next question: “What are the problematic calls we should think of?“. Good one. From a high-level view of AEM page renderings, you cannot avoid the servlet-calls which render the components. That means that you as an AEM application developer cannot influence the overall page rendering process, but you can only try to optimise the rendering of individual (custom) components. To optimise these, you should be aware, that the following things trigger the invocation of a servlet during page rendering:
the <cq:include>, <sling:include> and <sling:forward> JSP tags
the data-sly-include statement of HTL
and every method which invokes directly or indirectly the service() method of a servlet.
A good way to check this for some pages is the “Recent requests” functionality of the OSGI Webconsole.
For many large websites, CDNs are the foundation for delivering content quickly to their customers around the world. The ability of CDNs to cache responses close to consumers also allows these sites to operate on a small hardware footprint. However, compared to what they would have to invest if they operated without a CDN and delivered all content through their own systems, this comes at a cost: your CDN may now deliver content that is out of sync with your origin because you changed the content on your own system. This change is not done in an atomic fashion. This is the same “atomic” as in the ACID principle of database implementations. This is a conscious decision, and it is caused primarily by the CAP theorem. It states that in a distributed data storage system, you can only achieve 2 of these 3 guarantees:
Consistency
Availability
Partition tolerance
And in the case of a CDN (which is a highly distributed data storage system), its developers usually opt for availability and partition tolerance over consistency. That is, they accept delivering content that is out of date because the originating system has already updated it.
To mitigate this situation the HTTP protocol has features built-in which help to mitigate the problem at least partially. Check out the latest RFC draft on it, it is a really good read. The main feature is called “TTL” (time-to-live) and means that the CDN delivers a version of the content only for a configured time. Afterwards the CDN fetches a new version will from the origin system. The technical term for this is “eventual consistent” because at that point the state of the system with respect to that content is consistent again.
This is the approach all CDNs support, and it works very reliable. But only if you accept that you change content on the origin system and that it will reach your consumers with this delay. The delay is usually set to a period of time that is empirically determined by the website operators, trying to balance the need to deliver fresh content (which requires a very low or no TTL) with the number of requests that the CDN can answer instead of the origin system (in this case, the TTL should be as high as possible). Usually it is in the range of a few minutes.
(Even if you don’t use a CDN for your origin systems, you need these caching instructions, otherwise browsers will make assumptions and cache the requested files on their own. Browsing the web without caching is slow, even on very fast connections. Not to mention what happens when using a mobile device over a slow 3G line … Eventual consistency is an issue you can’t avoid when working on the web.)
Caching is an issue you will always have to deal with when creating web presences. Try to cache as much as possible without neglecting the need to refresh or update content at a random time.
You need to constantly address eventual consistency. Atomic changes (that means changes are immediately available to all consumers) are possible, but they come at a price. You can’t use CDNs for this content; you must deliver it all directly from your origin system. In this case, you need to design your origin system so that it can function without eventual consistency at all (and that’s built in into many systems). Not to mention the additional load it will have to handle.
And for this reason I would always recommend not relying on atomic updates or consistency across your web presence. Always factor in eventual consistency in the delivery of your content. And in most cases even business requirements where “immediate updates” are required can be solved with a TTL of 1 minute. Still not “immediate”, but good enough in 99% of all cases. For the remaining 1% where consistency is mandatory (e.g. real-time stock trading) you need to find a different solution. And I am not sure if the web is always the right technology then.
And as an afterthought regarding TTL: Of course many CDNs offer you the chance to actively invalidate the content, but it often comes with a price. In many cases you can invalidate only single files. Often it is not an immediate action, but takes seconds up to many minutes. And the price is always that you have to have the capacity to handle the load when the CDN needs to refetch a larger chunk of content from your origin system.
During some recent work on performance improvements in request processing I used a tool, which is part of AEM for a very long time now; I cannot recall a time when it was NOT there. It’s very simple, but nevertheless powerful and it can help you to understand the processing of requests in AEM much better.
I am talking about the “Recent Requests Console” in the OSGI webconsole, which is a gem in the “AEM performance tuning” toolbox.
In this blog post I use this tool to explain the details of the request rendering process of AEM. You can find the detailed description of this process in the pages linked from this page (Sling documentation).
Screenshot “Recent requests”
With this Recent Requests screen (goto /system/console/requests) you can drill down into the rendering process of the last 20 requests handled by this AEM instance; these are listed at the top of the screen. Be aware that if you have a lot of concurrent requests you might often miss the request you are looking for, so if you really rely on it, you should increase the number of requests which are retained. This can be done via the OSGI configuration of the Sling Main Servlet.
When you have opened a request, you will see a huge number of single log entries. Each log entry contains as first element a timestamp (in microseconds, 1000 microseconds = 1 millisecond) relative to the start of the request. With this information you can easily calculate how much time passed between 2 entries.
And each request has a typical structure, so let’s go through it using the AEM Start page (/aem/start.html). So just use a different browser window and request that page. Then check back on the “Recent requests console” and select the “start.html”. In the following I will go through the lines, starting from the top.
Here we see the 2 log lines for a the resolving process of a resourcetype. It took 1932 microseconds to map the request “/aem/start.html” to the resourcetype “granite/core/components/login” with the path being /libs/granite/ui/content/shell/start. Additionally we see information about the selector, extension and suffix elements.
4923 TIMER_START{ServletResolution} 4925 TIMER_START{resolveServlet(/libs/granite/ui/content/shell/start)} 4941 TIMER_END{14,resolveServlet(/libs/granite/ui/content/shell/start)} Using servlet BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp) 4945 TIMER_END{21,ServletResolution} URI=/aem/start.html handled by Servlet=BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp)
That’s a nested servlet resolution, which takes 14 respective 21 microseconds. Till now that’s mostly standard and hard to influence performance-wise. But it already gives you a lot information, especially regarding the resourcetype which is managing the complete response processing.
These are all request-level filters, which are executed just once per request.
And now the interesting part starts: the rendering of the page itself. The building blocks are called “components” (that term is probably familiar to you) and it always follows the same pattern:
Calling Component Filters
Executing the Component
Return from the Component Filters (in reverse order of the calling)
This pattern can be clearly seen in the output, but most often it is more complicated because many components include other components, and so you end up in a tree of components being rendered.
As an example for the straight forward case we can take the “head” component of the page:
At the top you see the LOG statement “Including resource …” which provides you with the information what resource is rendered, including additional information like selector, extension and suffix.
As next statement we have the resolution of the renderscript which is used to render this resource, plus the time it took (40 microseconds).
Then we have the invocation of all component filters, the execution of the render script itself, which is using a TIMER to record start time, end time and duration (18576 microseconds), and the unwinding of the component filters.
If you use a recent version of the SDK for AEM as a Cloud Service, all timestamps are in microseconds, but in AEM 6.5 and older the duration measured for the Filters (inner=…, outer=…) were printed in miliseconds (which is an inconsistency I just fixed recently).
If a component includes another component, it looks like this:
You see the component filters, but then after the TIMER_START for the page.jsp (check the trailing timer number: #0, every timer has a unique ID!) line you see the inclusion of a new resource. For this again the render script is resolved and instead of the ComponentFilters the IncludeFilters are called, but in the majority of cases the list of filters are identical. And depending on the resource structure and the script, the rendering tree can get really deep. But eventually you can see that the the rendering of the page.jsp is completed; you can easily find it by looking for the respective timer ID.
Equipped with this knowledge you can now easily dig into the page rendering process and see which resources and resource types are part of the rendering process of a page. And if you are interested in the bottlenecks of the page rendering process you can check the TIMER_END lines which both include the rendering script plus the time in microseconds it took to render it (be aware, that this time also includes it too to render all scripts invoked from this render script).
But the really cool part is that this is extensible. Via the RequestProgressTracker you can easily write your own LOG statements, start timers etc. So if you want to debug requests to better understand the timing, you can easily use something like this:
And then you can find this log message in this screen when this component is rendered. You can use it to output useful (debugging) information or just have use its timestamp to identify performance problems. This can be superior to normal logging (to a logfile), becaus you can leave these statements in production code, and they won’t pollute the log files. You just need to have access to the OSGI webconsole, search for the request you are interested and check the rendering process.
And if you are interested, you can can also get all entries in this screen and do whatever you like. For example you can write a (request-level) filter, which calls first the next filter, and afterwards logs all entries of the RequestProgressTracker to the logfile, if the request processing took more than 1 second.
The Request Progress Tracker plus the “Recent Requests” Screen of the OSGI webconsole are a really cool combination to both help you to understand the inner working of the Sling Request Processing, and it’s also a huge help to analyze and understand the performance of request processing.
I hope that this technical deep dive into the sling page rendering process was helpful for you, and you are able to spot many interesting aspects of an AEM system just be using this tool. If you have questions, please leave me a comment below.
This time I want to discuss a different type of micro-optimization. It’s not something you as a developer can implement in your code, but it’s rather a question of the application design, which has some surprising impact. I came across it when I recently investigated poor performance in the Siteadmin navigation. And although I did this investigation in AEM as a Cloud Service, the logic on AEM 6.5 behaves the same way.
When you click in the siteadmin navigation through your pages, AEM collects a lot of information about pages and folders to display them in the proper context. For example, when you click on page with child pages, it collects information what actions should be displayed if a specific child node is going to be selected (copy, paste, publish, …)
An important information is if the “Create page” action should be made available. And that’s the thing I want to outline in this article.
Screenshot: “Create” dialog
Assuming that you have the required write permissions on that folder, the most important is if templates are allowed to be created as children of the current page. The logic is described in the documentation and is quite complex.
In short:
On the content the template must be allowed (using the cq:allowedTemplates property (if present) AND
The template must be allowed to be used as a child page of the current page
Both conditions are must be met for a template to make it eligible to be used as a source for a new page. To display the entry “Page” it’s sufficient if at least 1 template is allowed.
Now let’s think about the runtime performance of this check, and that’s mostly determined by the total number of templates in the system. AEM determines all templates by this JCR query:
//jcr:content/element(*,cq:Template)
And that query returns 92 results on my local SDK instance with WKND installed. If we look a bit more closely to the results, we can determine 3 different types of templates:
Static templates
Editable templates
Content Fragment models
So depending on your use-case it’s easy to end up with hundreds of templates, and not all of them are applicable at the location you are currently in. In fact, typically just very few templates can be used to create a page here. That means that the check most likely needs to iterate a lot to eventually encounter a template which is a match.
Let’s come back to the evaluation if that entry should be displayed. If you have defined the cq:allowedTemplates property on the page or it’s ancestors it’s sufficient to check the templates listed there. Typically it’s just a handful of templates, and it’s very likely that you find a “hit” early on, which immediately terminates this check with a positive result. I want to explicitly mention that not every template listed can be created here, because there also other constraints (e.g. the parent template must be of a certain type etc) which must match.
If template A is allowed to be used below /content/wknd/en, then we just need to check the single Template A to get that hit. We don’t care, where in the list of templates it is (which are returned by the above query), because we know exactly which one(s) to look at.
If that property is not present, AEM needs to go through all templates and check the conditions for each and every one, until it finds that positive result. And the list of templates is identical to the order in which the templates are returned from the JCR query, that means the order is not deterministic. Also it is not possible to order the result in a helpful way, because the semantic of our check (which include regular expressions) cannot be expressed as part of the JCR query.
So you are very lucky if the JCR query returns a matching template already at position 1 of the list, but that’s very unlikely. Typically you need to iterate tens of templates to get a hit.
So, what’s the impact on the performance of this iteration and the checks? In an synthetic check with 200 templates, when I did not have any match, it took around 3-5ms to iterate and check all of the results.
You might ask, “I really don’t feel a 3-5ms delay”, but when the list view in siteadmin performs this check for up to 40 pages in a single request, it’s rather a 120-200 millisecond difference. And that is a significant delay for requests where bad performance is visible immediately. Especially if there’s a simple way to mitigate this.
And for that reason I recommend you to provide “cq:allowedTemplates” properties on your content structure. In many cases it’s possible and it will speed up the siteadmin navigation performance.
And for those, who cannot change that: I currently working on changing the logic to speedup the processing for the cases where no cq:allowedTemplates property is applicable. And if you are on AEM as a Cloud Service, you’ll get this improvement automatically.
Welcome to my third post on AEM micro-optimizations. Again with some interesting ways how you can improve your AEM application performance, somethings with little improvements, but sometimes with significant ones.
During some recent performance optimization I came across code, which felt a bit odd. Technically it was quite easy:
for (Item item : manyItems) { proprocessSingleItem (resolver, item);}void processSingleItem (ResourceResolver resolver, Item i} {// do something with the resourceResolverresolver.commit();}
That is indeed a very common pattern, especially in software, which evolved over time: You have code, which deals with a single item. And later, if you need to do it for multiple items, you execute this code in a loop. Works perfectly, and the pattern is widely used.
And it can be problematic.
If you have an operation in that performSingleItem() method, which comes with a method creating some overhead . Maybe you are not aware of that overhead, so it goes unnoticed. Maybe you expect, that if a that performSingleItem() method takes 5 ms for an item, requiring 50 ms for 10 items is ok. Well, an O(n) algorithm isn’t too bad, is it?
But what if I tell you, that the static overhead of that method is that so large, that providing 10 items as parameters instead of just one will increase the runtime of it not by a factor of 10, but only by a factor of 1.1?
Imagine you need to go grocery shopping for your Sunday dinner. You get yourself ready, take the bike to the grocery store, get the potatoes you need. Pay, and get back home. Drop the potatoes there. Then again, taking the bike to the grocery store, getting the some meat. Back home. Again to the grocery store, this time for paprika (grilled paprika are delicious …). And so on and so on, until you have everything you need for your barbecue on Sunday. You spent now 6 hours mostly on the bike and waiting at the counter.
Are you doing that? No, of course not. You drive once to the grocery store, get all the things and pack them onto your bike, and get home. Takes maybe 90 minutes. Have the static overhead (cycling, waiting at the counter) just once saves a lot of it.
It’s the same in coding. You have static overhead (acquiring locks, getting database connections, network latency, calling through thick framework layers will just copying references to the data), which is not determined by the amount of data you process. But unlike in the example of grocery shopping it’s not directly visible at which times there is such a static overhead, and unfortunately documentation rarely point that out.
Writing to the repository comes with such a static overhead; and it can be like a 20 minutes ride to the grocery store. Saving 10 times smaller batches definitely takes more time than saving once with a batch of 10-times the size. At least if you keep the size of the changeset limited, for details here check this earlier posting of mine.
Check this great presentation of Georg Henzler at adaptTo() 2019 (starting at 17:00min ) (slides) for some benchmark data, how the size of the changeset influences the time to save (spoiler: for realistic sizes it does not really increase).
So I changed the above code to something like this:
for (Item item : manyItems) {proprocessSingleItem (resolver, item);}
resolver.commit();
void processSingleItem (ResourceResolver resolver, Item i} {// do something with the resourceResolver but no commit
}
Switching to this approach improved the performance for ~ 100 items by a factor of more than 10! And that’s an impressive number for such a minimal change.
So check your code for this specific coding pattern, find out if the parameters are good (that means small changes) and add some performance logging. And then convert to this batching mode and see what your numbers are doing.
Of course, very often this saving is operating in the context of a much larger operation, and a 10 times improvement in this area will only speed up the larger operation of 12 seconds to 11 seconds. But hey, when you get this 1 second for almost free, just do it (and we are still talking about micro-optimizations). But nothing prevents you from taking a deeper look into what the system is doing in the remaining 11 seconds.
Leave me a comment if you have some interesting story to share, where such small changes resulted in big improvements.
Micro optimizations are important, and their importance is described by a LWN posting about the linux kernel:
Most users are unlikely to notice any amazing speed improvements resulting from these changes. But they are an important part of the ongoing effort to optimize the kernel’s behavior wherever possible; a long list of changes like this is the reason why Linux performs as well as it does.
And is not specific for the Linux kernel, but you can apply the same strategy to every piece of software. AEM as a complex (and admittedly, it can sometimes be really slow) beast applies the very same.
There are a number of cases in AEM, where do you operate not only single objcets (pages, assets, resources, nodes), but apply the same operation on multiple of these objects.
The naive approach of just iterating the list and execute the operation on a single element of that list can be quite ineffective, especially if this operation comes with a static overhead.
Some examples:
For replication there are some pre-checks, then the creation of the package, the creation of the sling jobs (or sending the package to the pipeline when running on AEM as a Cloud Service), the update of the replication status, writing the audit log entries.
When determining the replication status of a page, the replication queues need to checked if this page is still subject to a pending replication, which can get slow when the queues are full.
Committing changes to the JCR repository; there is a certain overhead in it (validating all changes, comitting them to permanent storage, invoking the synchronous listeners, locking etc).
And in many cases these bottlenecks are known for a while, and there is API which allows to perform this action in a batch mode for a multitude of elements:
Replication: Batch replication (you can provide a number of path strings)
(The ReplicationStatusProvider has been introduced some years back when we had to deal with large workflow packages being replicated, which resulted in a lot of traversales of the replication queue entries. Adding this optimized version improved the performance by at least a factor of 10; so even in less intense operations I expect an improvement.
So if you have a hand-crafted loop to execute a certain activity on many elements, check if a more efficient batch API is available. There’s a good chance that it is already there.
If you have more cases where batch mode should be available, you it isn’t, leave a comment here. I am happy to support to either find the right API or potentially kickstart a product improvement.
As a followup on the previous article I want to show you, how a micro-optimization can look like. My colleague Miroslav Smiljanic found that there is a significant difference in the time it takes to compute these statements (1) and (2).
Is this change important? Just by itself it is not, because going the resource/node tree upwards is not that common compared to going downwards the tree. So replacing a single call might yield only in an improvement of a fraction of a milisecond, even if the case (2) is up to 200 times faster than (1)!
But if we can replace the code in all cases where the getParent() can be used with the performant getParent() call, especially in the lowlevel areas of AEM and Sling, all areas might benefit from it. And then we don’t execute it only once per page rendering, but maybe a hundred times. And then we might end up with tens of miliseconds of improvement already, for any request!
And in special usecases the effect might be even higher (for example if your code is constantly traversing the tree upwards).
Another example of such an micro-optimization, which is normally quite insignificant but can yield huge benefits in special cases can be found in SLING-10269, where I found that a built-in caching of the isResourceType() results reduces the rendering times of some special requests by 50%, because it is done thousands of times.
Typically micro-optimizations have these properties:
In the general case the improvement is barely visible (< 1% improvement of performance)
In edge cases they can be a life saver, because they reduce execution time by a much larger percentage.
The interesting part is, that these improvements accumulate over time, and that’s where it is getting interesting. When you have implemented 10 of these in low-level routines the chances are high that your usecase benefits from it as well. Maybe by 10 times 0.5% performance improvement, but maybe also a 20% improvement, because you hit the sweet spot of one of these.
So it is definitely worth to pay attention to these improvements.
My recommendation for you: Read the entry in the Oak “Do’s and Don’ts” page and try to implement this learning in your codebase. And if you find more of such cases in the Sling codebase the community appreciates a ticket.
You must be logged in to post a comment.