How to return proper Last-Modified dates to search engines or CQ Dispatcher vs. HTTP Last-Modified vs. file mtime
Setup description:
CQ 5.6.1 with Apache 2.2 /Dispatcher dispatcher-apache2.2-4.1.5 in front of it, running on Linux with ext3 file system.
Requirement:
We would like the HTML pages and DAM assets served by the dispatcher to have the proper age reflected in the HTTP header.
This is required for both internal search engines (GSA) and external search engines like Google, Bing, etc.
According to HTTP RFC http://tools.ietf.org/html/rfc2616#section-14.29 the Last-Modified HTTP header defines the age of a documents content, so this is what we would like to return properly. (note: not to be confused with the Date header, which is the current date at the time of the request)
Analysis:
CQ stores the last modified dates in system properties in the jcr:content node, in cq:lastModified for pages and jcr:lastModified for DAM assets, which reflects the actual last change of a content. These dates are or could be returned in HTTP Last-Modified headers.
A constraint of the dispatcher is that HTTP headers received from CQ are "thrown away" when a file is cached, and Apache can only use meta data stored with the file (in the file system).
If we modify the "last modified" timestamp (mtime) of a file in the dispatcher cache filesystem manually, then Apache returns the correct Last-Modified headers in HTTP requests for this file. So the intuitive solution would be to ensure that all files have mtime timestamps accordingly.
Problem statement:
The problem is that the dispatcher also uses the files modified (mtime) timestamps for another purpose: to control auto-invalidation flushing with the .stat files.
A file with a more recent mtime timstamp than the .stat file is considered up-to-date and delivered from the dispatcher cache, a file with an older mtime timestamp is considered stale and is re-fetched from CQ again.
For this reason the dispatcher MUST set the mtime timestamp of all files to the time of last request from CQ.
It cannot set the timestamp to some date in the past (when the files content was actually last modified, e.g. the node property in CQ), since this would always be the same as long as the content does not change.
Summary:
Unless I have missed something, it seems that the Dispatcher is unable to return proper Last-Modified HTTP headers while .stat files and auto-invalidation is used - and this is by design.
Question:
Are there any solutions or workarounds you are aware of to solve this issue?
Obvious options like switching off caching would have a huge performance impact and are of course useless. Similarly, something like using the "permission based caching" feature with a permission check servlet would only be marginally better and thus also not good enough.