How to analyze “Authentication support missing”

Errors and problems in running software manifest often in very interesting and non-obvious cases. A problem in location A manifests itself only with an unrelated error message in a different location B.

We also have one example of such a situation in AEM, and that’s the famous “Authentication support missing” error message.  I see often the question “I got this error message; what should I do now?”, and so I decided: It’s time to write a blog post about it. Here you are.

“Authentication support missing” is actually not even correct: There is no authentication module available, so you cannot authenticate. But in 99,99% of the cases this is just a symptom. Because the default AEM authentication depends on a running SlingRepository service. And a running Sling repository has a number of dependencies itself.

I want to highlight 2 of these dependencies, because they tend to cause problems most often: The Oak repository and the RepositoryInitializer service. Both must be up and be started/run succesfully until the SlingRepository service is being registered succesfully. Let’s look into each of these dependencies.

The Oak repository

The Oak repository is a quite complex system in itself, and there are many reasons why it did not start. To name a few:

  • Consistency problems with the repository files on disk (for whatever reasons), permission problems on the filesystem, full disks, …
  • Connectivity issues towards the storage (especially if you use a database or mongodb as storage)
  • Messed up configuration

If you have an “authentication support missing” message, you first check should be on the Oak repository, typically reachable in the AEM error.log. If you have an ERROR messages logged by any “org.apache.jackrabbit.oak” class during the startup, this is most likely the culprit. Investigate from there.

Sling Repository Initializer (a.k.a. “repoinit”)

Repoinit is designed to ensure that a certain structure in the repository is provided, even before any consumer is accessing it. All of the available scripts must be executed, and any failure will immediate terminate the startup of the SlingRepositoryService. Check also my latest blog post on Sling Repository Initializer for details how to prevent such problems.

Repoinit failures are typically quite prominent in the AEM error.log, just search for an ERROR message starting with this:

*ERROR* [Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted …

These are 2 biggest contributors to this “Authentication support missing” error messages. Of course there are more reasons why it could appear. But to be honest, I only have seen these 2 cases in the last years.

I hope that this article helps you to investigate such situations more swiftly.

How to deal with RepoInit failures in Cloud Service

Some years, even before AEM as a Cloud Services, the RepoInit language has been implemented as part of Sling (and AEM) to create repository structures directly on the startup of the JCR Repository. With it your application can rely that some well-defined structures are always available.

In this blog post I want to walk you through a way how you can test repoinit statements locally and avoid pipeline failures because of it.

Repoinit statements are deployed as part of OSGI configurations; and that means that during the development phase you can work in an almost interactive way with it. Also exceptions are not a problem; you can fix the statement and retry.

The situation is much different when you already have repoinit statements deployed and you startup your AEM (to be exact: the Sling Repository service) again. Because in this case all repoinit statements are executed as part of the startup of the repository. And any exception in the execution of repoinits will stop the startup of the repository service and render your AEM unusable. In the case of CloudManager and AEM as a Cloud Service this will break your deployment.

Let me walk you through 2 examples of such an exception and how you can deal with it.

*ERROR* [Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted java.lang.RuntimeException: failed: javax.jcr.nodetype.ConstraintViolationException: OakConstraint0025: /conf/site/configuration/favicon.ico[[nt:file]]: Mandatory child node jcr:content not found in a new node 
at [] 

In this case the exception is quite detailed what actually went wrong. It failed when saving, and it says that /conf/site/configuration/favicon (of type nt:file) was affected. The problem is that a mandatory child node “jcr:content” is missing.

Why is it a problem? Because every node of nodetype “nt:file” requires a “jcr:content” child node which actually holds the binary.

This is a case which you can detect very easily also on a local environment.

Which leads to the first recommendation:

When you develop in your local environment, you should apply all repoinit statements to a fresh environment, in which there are no manual changes. Because otherwise your repoinit statements rely on the presence of some things which are not provided by the repoinit scripts.

Having a mix of manual changes and repoinit on a local development environment and then moving it untested over is often leads to failures in the CloudManager pipelines.

The second example is a very prominent one, and I see it very often:

[Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted java.lang.RuntimeException: Failed to set ACL (java.lang.UnsupportedOperationException: This builder is read-only.) AclLine DENY {paths=[/libs/cq/core/content/tools], privileges=[jcr:read]} 

It’s the well-known “This builder is read-only” version. To understand the problem and its resolution, I need to explain a bit the way the build process assembles AEM images in the CloudManager pipeline.

In AEM as a cloud service you have an immutable part of the repository, which consists out of the trees “/libs” and “/apps”. They are immutable, because they cannot be modified on runtime, not even with admin permissions.

During build time this immutable part of the image is built. This process merges both product side parts (/libs) and custom application parts (/apps) together. After that also all repoinit scripts run, both the ones provided by the product as well as any custom one. And of course during that part of the build these parts are writable, thus writing into /apps using repoinit is not a problem.

So why do you actually get this exception, when /libs and /apps are writeable? This is because repoinit is executed a second time. During the “final” startup, when /apps and /libs are immutable.

Repoinit is designed around that idea, that all activities are idempotent. This means that if you want to create an ACL on /apps/myapp/foo/bar the repoinit statement is a no-op if that specific ACL already exists. A second run of repoinit will do nothing, but find everything still in place.

But if in the second run the system executes this action again, it’s not an no-op anymore. This means that this ACL is not there as expected. Or whatever the goal of that repoinit statement was.

And there is only one reason why this happen. There was some other action between these 2 executions of repoinit which changed the repository. The only thing which also modifies the repository are installations of content packages.

Let’s illustrate this problem with an example. Imagine you have this repoinit script:

create path /apps/myapp/foo/bar
set acl on /apps/myapp/foo/bar
  allow jcr:read for mygroup

And you have a content package which comes with content for /apps/myapp and the filter is set to “overwrite”, but not containing this ACL.

In this case the operations leading to this error are these:

  • Repoinit sets the ACL on /apps/myapp/foo/bar
  • the deployment overwrites /apps/myapp with the content package, so the ACL is wiped
  • AEM starts up
  • Repoinit wants to set the ACL on /apps/myapp/foo/bar, which is now immutable. It fails and breaks your deployment.

The solution to this problem is simple: You need to adjust the repoinit statements and the package definitions (especially the filter definitions) in a way, that the package installation does not wipe and/or overwrite any structure created by repoinit. And with “structure” I do not refer only to nodes, but also nodetypes, properties etc. All must be identical, and in the best case they don’t interfere.

It is hard to validate this locally, as you don’t have an immutable /apps and /libs, but there is a test approach which comes very close to it:

  • Run all your repoinit statements in your local test environment
  • Install all your content packages
  • Enable write tracing (see my blog post)
  • Re-run all your repo-init statements.
  • Disable write tracing again

During the second run of the repoinit statements you should not see any write in the trace log. If you have any write operation, it’s a sign that your packages overwrite structures created by repoinit. You should fix these asap, because they will later break your CloudManager pipeline.

With this information at hand you should be able to troubleshoot any repoinit problems already on your local test environment, avoiding pipeline failures because of it.