Expand my Community achievements bar.

Dive into Adobe Summit 2024! Explore curated list of AEM sessions & labs, register, connect with experts, ask questions, engage, and share insights. Don't miss the excitement.
SOLVED

How to return proper Last-Modified dates to search engines or CQ Dispatcher vs. HTTP Last-Modified vs. file mtime

Avatar

Level 2

Setup description:
CQ 5.6.1 with Apache 2.2 /Dispatcher dispatcher-apache2.2-4.1.5 in front of it, running on Linux with ext3 file system.

Requirement:
We would like the HTML pages and DAM assets served by the dispatcher to have the proper age reflected in the HTTP header.
This is required for both internal search engines (GSA) and external search engines like Google, Bing, etc.
According to HTTP RFC http://tools.ietf.org/html/rfc2616#section-14.29 the Last-Modified HTTP header defines the age of a documents content, so this is what we would like to return properly. (note: not to be confused with the Date header, which is the current date at the time of the request)

Analysis:
CQ stores the last modified dates in system properties in the jcr:content node, in cq:lastModified for pages and jcr:lastModified for DAM assets, which reflects the actual last change of a content. These dates are or could be returned in HTTP Last-Modified headers.
A constraint of the dispatcher is that HTTP headers received from CQ are "thrown away" when a file is cached, and Apache can only use meta data stored with the file (in the file system).
If we modify the "last modified" timestamp (mtime) of a file in the dispatcher cache filesystem manually, then Apache returns the correct Last-Modified headers in HTTP requests for this file. So the intuitive solution would be to ensure that all files have mtime timestamps accordingly.

Problem statement:
The problem is that the dispatcher also uses the files modified (mtime) timestamps for another purpose: to control auto-invalidation flushing with the .stat files.
A file with a more recent mtime timstamp than the .stat file is considered up-to-date and delivered from the dispatcher cache, a file with an older mtime timestamp is considered stale and is re-fetched from CQ again.
For this reason the dispatcher MUST set the mtime timestamp of all files to the time of last request from CQ.
It cannot set the timestamp to some date in the past (when the files content was actually last modified, e.g. the node property in CQ), since this would always be the same as long as the content does not change.

Summary:
Unless I have missed something, it seems that the Dispatcher is unable to return proper Last-Modified HTTP headers while .stat files and auto-invalidation is used - and this is by design.

Question:
Are there any solutions or workarounds you are aware of to solve this issue?
Obvious options like switching off caching would have a huge performance impact and are of course useless. Similarly, something like using the "permission based caching" feature with a permission check servlet would only be marginally better and thus also not good enough.

1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

Hi,

you are right, when you use the dispatcher for caching, any request answered from the cache doesn't have the the proper headers CQ sent along. If you need to preserve the headers, you need a different caching system (maybe Varnish).

Jörg

View solution in original post

4 Replies

Avatar

Correct answer by
Employee Advisor

Hi,

you are right, when you use the dispatcher for caching, any request answered from the cache doesn't have the the proper headers CQ sent along. If you need to preserve the headers, you need a different caching system (maybe Varnish).

Jörg

Avatar

Level 10

agree with jorg.  At same time there was internal discussions on this not yet implemented in dispatcher please file daycare to track this. 

Avatar

Level 2

Already did, ticket ID is: 60842 :-)

On one hand, there maybe a way to implement this in dispatcher (I also suggested this in the ticket):

#####
Dispatcher support for last-modified

The dispatcher should be changed so that it can handle and store 2 timestamps for every file it caches:
the last-modfied timestamp of the content (e.g. the time when the content data was last changed)
a last-cached timestamp it uses for purposes of dispatcher cache auto-invalidation

Since most Unix file systems have no created date for a file, at least one of the two timestamp meta data mentioned needs to be stored somehow. Since storing meta data outside of the actual file is cumbersome, please consider the following approach:

store the 2 timestamps in Unixs mtime and ctime fields. Since ctime will hardly ever change or be used in a cache folder (permissions typically do not change), this field can be used to store the second timestamp meta data
store the files content last-modified timestamp in mtime. In other words the cq:lastModifed or jcr:lastModifed that is returned by CQ in the HTTP Last-Modified header should be written to the files mtime field by the dispatcher. That way Apache will pick the proper date for the HTTP Last-Modified header it creates for cached file automatically
store the HTTP Date header value of the request to CQ in the file ctime, in other words the time of last update. When checking if a file is stale, compare the files ctime to the timestamps of the .stat files (instead of the files mtime)

This approach would not require any extra files for meta data storage and it would not have any impacts to existing installations. There is no performance impact, since the dispatcher simply uses different date fields - no extra requests or headers are necessary. Also there is no functional impact, since the file permissions typically do not change in the cache folder other than at file creation time.

See also:
http://superuser.com/questions/387042/how-to-check-all-timestamps-of-a-file
http://www.unix.com/tips-and-tutorials/20526-mtime-ctime-atime.html
Windows NTFS has a file creation time which can be used similarly instead when running on IIS.
#####

I admit this is something of a hack though, and more and more I am coming to the conclusion that Varnish is the better dispatcher anyway.

It allows far more flexibility in caching and especially flushing, as it supports full regexes where the dispatcher just has simple patterns (apart from a dozen other advantages).

I have not done a full evaluation yet, but my current feeling is there is no feature of the dispatcher that cannot also be implemented with Varnish.

Having said that, are there any thoughts at Adobe to get rid of the dispatcher altogether and make Varnish the "official" suggested caching layer/plugin? Alternatively, how about open sourcing the dispatcher code or at least making the source available? (there have been some occasions where I wanted to know what exactly goes on under the hood and reverse-engineering is cumbersome)

Avatar

Level 2

I think now we have the solution for this. Now dispatcher also maintains a corresponding header file (*.h) to store the headers. I know it's too late to put this comment as this is a kind of basic feature everybody is aware of. Even I wasn't aware that we have this type of issue in past. But I am still having a pain point in using this Last-Modified header/property and a deep search for that one leads me here. The issue is that how can we use this property in the case of multiple publishers. How can we sync the lastModified property among all the publishers so that all dispatchers (all requests) have the same Last-Modified header value?