If you are working in IT (or a crafter) you should know the saying: “When you have a hammer, every problem looks like a nail”. It describes the tendency of people, that if they have a tool, which helps them to reliably solve a specific problem, that they will try to use this tool at every other problem, even if it does not fit at all.
Sometimes I see this pattern in AEM as well, but not with a hammer, but with “curl”. Curl is a commandline HTTP client, and it’s quite easy to fire a request against AEM and do something with the output of it. It’s something every AEM developer should be familiar with, also because it’s a great tool to automate things. And if you talk about “automating AEM”, the first thing people often come up with is “curl”…
And there the problem starts: Not every problem can be automated with curl. For example take a periodic data export from AEM. The immediate reaction of most developers (forgive me if I generalize here, but I have seen this pattern too often!) is to write a servlet to pull all this data together, create a CSV list and then use curl to request this servlet every day/week.
Works great, does it? Good, mark that task as done, next!
Wait a second, on prod it takes 2 minutes to create that list. Well, not a problem, right? Until it takes 20 minutes, because the number of assets is growing. And until you move to AEM CS, where the timeout of requests is 60 seconds, and your curl is terminated with a statuscode 503.
So what is the problem? It is not the timeout of 60 seconds; and it’s also not the constantly increasing number of assets. It’s the fact, that this is a batch operation, and you use a communication pattern (request/response), which is not well suited for batch operations. It’s the fact, that you start with curl in mind (a tool which is built for the request/response pattern) and therefor you build the implementation around it this pattern. You have curl, so every problem is solved with a request.
What are the limits of this request/response pattern? Definitely the runtime is a limit, and actually for 3 reasons:
- The timeout for requests on AEM CS (or basically any other loadbalancer) is set for security reasons and to keep the prevent misuse. Of course the limit of 60 seconds in AEM CS is a bit arbitrary, but personally I would not wait 60 seconds for a webpage to start rendering. So it’s as good as any higher number.
- There is another limit, which is determined by the availability of the backend system, which is actually processing this request. In an high-available and autoscaling environment systems start and stop in an automated fashion, managed by a control-plane which operates on a set of rules. And these rules can enforce, that any (AEM-) system will be forced to shutdown at maximum 10 minutes after it has stopped to receive new requests. And that means for a requests, which would take constantly 30+ minutes, that it might be terminated, without finishing successfully. And it’s unclear if your curl would even realize it (especially if you are streaming the results).
- (And technically you can also add that the whole network connection needs to be kept open for that long, and AEM CS itself is just a single factor in there. Also the internet is not always stable, you can experience network hiccups and any point in time. It’s normally just well hidden by retrying failing requests. Which is not an option here, because it won’t solve the problem at all.)
In short: If your task can take long (say: 60+ seconds), then a request is not necessarily the best option to implement it.
So, what options do you have then? Well, the following approach works also in AEM CS:
- Use a request to create and initiate your task (let’s call it a “job”);
- And then poll the system until this job is completed, then return the result.
This is an asynchronous pattern, and it’s much more scalable when it comes to the amount of processing you can do in there.
Of course you cannot use a single curl command anymore, but now you need to write a program to execute this logic (don’t write it in a shell-script please!); but on the AEM side you can now use either sling jobs or AEM workflows and perform the operation.
But this avoids this restriction on 60 seconds and it can handle restarts of AEM transparently, at least on author side. And you have the huge benefit, that you can collect all your errors during the runtime of this job and decide afterwards, if the execution was a success or failed (which you cannot do in HTTP).
So when you have long-running operations, check if you need to do them within a request. In many cases it’s not required, and then please switch gears to some asynchronous pattern. And that’s something you can do even before the situation starts to get a problem.
You must be logged in to post a comment.