Hello Community,
Recently, we encountered some unexpected behavior in AEM, and the results of our investigation were even more surprising. Given that this issue is not specific to our project but seems to be related to AEM's core code, some of you might have come across it before. We would appreciate any insights or suggestions on how to properly address this issue. Please share your thoughts and experiences.
After deploying to production, requests to the home page resulted in a blank page. The HTML served appeared invalid: it had a valid <head> tag and its content, but the <body> tag was only opened without any content. Moreover, the <body> and <html> tags were left unclosed.
After a brief investigation, we discovered that requests with query parameters were processed correctly, indicating that the erroneous markup was only present at the dispatcher level.
From the AEM error logs, we identified that the root of the problem was AEM's inability to locate an HTML template necessary for the header component that was actually present in JCR. This led to the following error:
org.apache.sling.scripting.sightly.SightlyException: Cannot find resource ./templates/CUSTOM_TEMPLATE.html for base path /apps/PROJECT/components/header.
Further analysis of the dispatcher logs suggested that this request might have been sent by the skyline-service-warmup agent. This pointed to a potential issue with the Dispatcher Cache Warmup Service being triggered too early in the deployment process. Ideally, if AEM could not process the request properly at that time, it should have returned a 500 error, preventing the response from being cached.
So, how did a page with clearly erroneous markup end up being cached with invalid HTML?
After extensive debugging of AEM bundles, we found that the issue was related to the size of the HTML markup. In a nutshell, if an exception occurs when response buffer is overflowed, AEM will log the exception, but response status and content will remain unchanged.
When a request is received by an AEM instance, it is processed through the registered Sling filters, which write the results of parsed HTL templates into the response instance. The response modification is handled by the org.eclipse.jetty.server.HttpOutput's write method. Given its complexity, I’ll focus on the key parts related to the issue:
this.checkWritable();
long written = this._written + (long)len;
l = this.maximizeAggregateSpace();
last = this._channel.getResponse().isAllContentWritten(written);
aggregate = len <= this._commitSize && (!last || BufferUtil.hasContent(this._aggregate) && len <= l);
boolean flush = last || !aggregate || len >= l;
if (aggregate) {
this.acquireBuffer();
int filled = BufferUtil.fill(this._aggregate, b, off, len);
if (!flush) {
if (LOG.isDebugEnabled()) {
LOG.debug("write(array) {} aggregated !flush {}", this.stateString(), BufferUtil.toDetailString(this._aggregate));
}
break label160;
}
off += filled;
len -= filled;
}
boolean complete = false;
if (BufferUtil.hasContent(this._aggregate)) {
complete = last && len == 0;
this.channelWrite(this._aggregate, complete);
...
}
The process then transitions to the org.eclipse.jetty.server.HttpChannel.sendResponse(MetaData.Response response, ByteBuffer content, boolean complete, Callback callback) method, which is responsible for committing the request.
protected boolean sendResponse(MetaData.Response response, ByteBuffer content, boolean complete, Callback callback) {
boolean committing = this._state.commitResponse();
...
If an exception occurs during processing, the request handling is passed to the ErrorFilterChain, which is responsible for rewriting the response to adjust its content and status. However, this adjustment only happens if the response has not yet been committed.
if (response.isCommitted()) {
if (this.mode == ErrorFilterChain.Mode.STATUS) {
if (this.message == null) {
LOG.warn("handleError: Response already committed; cannot send error ".concat(String.valueOf(this.status)));
} else {
LOG.warn("handleError: Response already committed; cannot send error ".concat(String.valueOf(this.status)).concat(" : ").concat(this.message));
}
} else if (this.throwable.getMessage() != null) {
LOG.warn("handleError: Response already committed; cannot send error ".concat(this.throwable.getMessage()), this.throwable);
} else {
LOG.warn("handleError: Response already committed; cannot send error ".concat(this.throwable.getClass().getName()), this.throwable);
}
return;
}
response.reset();
The issue we encountered stemmed from having too much JavaScript code added into the <head> block, which led to the byte buffer overflowing. As a result, when the SightlyException was thrown, the response was already committed. This prevented the ErrorFilterChain from properly modifying the response status and content.
To replicate the issue, I will use the WKND project and its /apps/wknd/components/page component. This component inherits from the core/wcm/components/page/v3/page component and overlays the customheaderlibs.html and customfooterlibs.html templates. The customheaderlibs template is injected into the <head> block, while the customfooterlibs template is injected after the page content is included in the markup.
To demonstrate the difference in processing, I will add a reference to a non-existent resource in both of these files. This will help illustrate how the system handles missing resources differently according to the error location within the HTML.
The reference was added at the end of the file.
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<!--/* Google Fonts */-->
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Asar&family=Source+Sans+Pro:ital,wght@0,300;0,400;0,600;1,300;1,400;1,600&display=swap" rel="stylesheet">
<!--/* Include Context Hub */-->
<sly data-sly-resource="${'contexthub' @ resourceType='granite/contexthub/components/contexthub'}"/>
<!--/* Manifest */-->
<meta name="theme-color" content="#FFEA00">
<link rel="manifest" href="/etc.clientlibs/wknd/clientlibs/clientlib-site/resources/manifest.json">
<!--/* Favicons */-->
<sly data-sly-include="favicons.html"></sly>
<sly data-sly-use.template="template-123.html" data-sly-call="${template.template}"/>
When AEM processes the references, the buffer still has plenty of free space, and the response is not yet committed. As a result, when the page is requested, the out-of-the-box (OOTB) error page is displayed.
The customfooterlibs.html file is empty by default, so the invalid reference will be its sole content.
<sly data-sly-use.template="template-123.html" data-sly-call="${template.template}"/>
This template is injected after the actual page content has already been added to the markup. The page content causes the buffer to overflow, resulting in the response being committed. Consequently, when the page is requested, it initially appears normal because the response status is 200 and the content is present. However, upon inspecting the markup, you’ll notice that the closing </body> and </html> tags are missing. This issue arises because customfooterlibs should have been included just before these closing tags. Due to the exception, processing was handed off to the ErrorFilterChain, which did not handle the response correctly.
@RustamMussabay Great insights!
Hi @RustamMussabay
Thanks for sharing. If you want you can open an issue at https://github.com/adobe/htl-spec/issues to through error if non-existing template is found in HTL.
Hi @arunpatidar ,
Is HTL spec the correct place to address the issue? It is not only related to non-existing template and will occur if, for example, the exception is thrown inside a Sling Model.
Hi @RustamMussabay
This issue isn't related to Sling Models because the missing template occurs at the presentation layer and can be detected, similar to other types of missing references. I understand this won't resolve all problems related to buffer size, it could serve as an early detection if the issue is caused by HTL.
Hi @arunpatidar ,
I provided the missing template as an example of an error that might occur during page rendering. While the missing template results in a SightlyException, the issue would remain unchanged even if the template were present and contained a Sling Model that generated an exception. In such a case, AEM would still return a page with invalid HTML and a 200 status code.
That's a nice writeup, thank you!
I have debugged a few issues like this already, and it is indeed a common issue, especially if the exception is only happening sporadically. That means that you have a broken version in the dispatcher cache, but when a request bypasses the cache (e.g. by adding a query string to it) the page renders correctly.
This is a problem you can get with every type of application, where the responses are cached (and it does not have to be the dispatcher, it can also be a CDN). In AEM this can happen when a request is done "too early" and not all services are up already. I hope that I have fixed all that cases by now for AEM CS, but if you come across such a case, please raise a support ticket and it will likely get routed to me
Anyway, the easiest workaround to handle this situation is to re-activate the affected page again, as this will delete the incorrect cache file.
Hi @Jörg_Hoh ,
To be honest I raised a support ticket 17 days ago, and it has now been assigned to a third Support Engineer. Unfortunately, there has been no progress to date. The second Support Engineer even promised to request that the Engineering team disable the Warmup script. However, since the issue reoccurred a few days ago, I assume the script is still active. We are scheduled for an issue demo call with the SE today and I hope eventually the ticket will move forward.
While I consider the caching issue to be relatively minor, my primary concern is that AEM returns a 200 HTTP status code even when an exception occurs. If AEM were to return the correct 500 status code, the error response would not be cached.
Can you please share the ticket ID in private with me?
Views
Like
Replies
Views
Likes
Replies
Views
Likes
Replies