How to return proper Last-Modified dates to search engines or CQ Dispatcher vs. HTTP Last-Modified vs. file mtime

Avatar

Avatar

MeasurableBusin

Avatar

MeasurableBusin

MeasurableBusin

15-10-2015

Setup description:
CQ 5.6.1 with Apache 2.2 /Dispatcher dispatcher-apache2.2-4.1.5 in front of it, running on Linux with ext3 file system.

Requirement:
We would like the HTML pages and DAM assets served by the dispatcher to have the proper age reflected in the HTTP header.
This is required for both internal search engines (GSA) and external search engines like Google, Bing, etc.
According to HTTP RFC http://tools.ietf.org/html/rfc2616#section-14.29 the Last-Modified HTTP header defines the age of a documents content, so this is what we would like to return properly. (note: not to be confused with the Date header, which is the current date at the time of the request)

Analysis:
CQ stores the last modified dates in system properties in the jcr:content node, in cq:lastModified for pages and jcr:lastModified for DAM assets, which reflects the actual last change of a content. These dates are or could be returned in HTTP Last-Modified headers.
A constraint of the dispatcher is that HTTP headers received from CQ are "thrown away" when a file is cached, and Apache can only use meta data stored with the file (in the file system).
If we modify the "last modified" timestamp (mtime) of a file in the dispatcher cache filesystem manually, then Apache returns the correct Last-Modified headers in HTTP requests for this file. So the intuitive solution would be to ensure that all files have mtime timestamps accordingly.

Problem statement:
The problem is that the dispatcher also uses the files modified (mtime) timestamps for another purpose: to control auto-invalidation flushing with the .stat files.
A file with a more recent mtime timstamp than the .stat file is considered up-to-date and delivered from the dispatcher cache, a file with an older mtime timestamp is considered stale and is re-fetched from CQ again.
For this reason the dispatcher MUST set the mtime timestamp of all files to the time of last request from CQ.
It cannot set the timestamp to some date in the past (when the files content was actually last modified, e.g. the node property in CQ), since this would always be the same as long as the content does not change.

Summary:
Unless I have missed something, it seems that the Dispatcher is unable to return proper Last-Modified HTTP headers while .stat files and auto-invalidation is used - and this is by design.

Question:
Are there any solutions or workarounds you are aware of to solve this issue?
Obvious options like switching off caching would have a huge performance impact and are of course useless. Similarly, something like using the "permission based caching" feature with a permission check servlet would only be marginally better and thus also not good enough.

Accepted Solutions (1)

Accepted Solutions (1)

Avatar

Avatar

Jörg_Hoh

Employee

Total Posts

3.1K

Likes

1.0K

Correct Reply

1.1K

Avatar

Jörg_Hoh

Employee

Total Posts

3.1K

Likes

1.0K

Correct Reply

1.1K
Jörg_Hoh
Employee

15-10-2015

Hi,

you are right, when you use the dispatcher for caching, any request answered from the cache doesn't have the the proper headers CQ sent along. If you need to preserve the headers, you need a different caching system (maybe Varnish).

Jörg

Answers (2)

Answers (2)

Avatar

Avatar

MeasurableBusin

Avatar

MeasurableBusin

MeasurableBusin

15-10-2015

Already did, ticket ID is: 60842 🙂

On one hand, there maybe a way to implement this in dispatcher (I also suggested this in the ticket):

#####
Dispatcher support for last-modified

The dispatcher should be changed so that it can handle and store 2 timestamps for every file it caches:
the last-modfied timestamp of the content (e.g. the time when the content data was last changed)
a last-cached timestamp it uses for purposes of dispatcher cache auto-invalidation

Since most Unix file systems have no created date for a file, at least one of the two timestamp meta data mentioned needs to be stored somehow. Since storing meta data outside of the actual file is cumbersome, please consider the following approach:

store the 2 timestamps in Unixs mtime and ctime fields. Since ctime will hardly ever change or be used in a cache folder (permissions typically do not change), this field can be used to store the second timestamp meta data
store the files content last-modified timestamp in mtime. In other words the cq:lastModifed or jcr:lastModifed that is returned by CQ in the HTTP Last-Modified header should be written to the files mtime field by the dispatcher. That way Apache will pick the proper date for the HTTP Last-Modified header it creates for cached file automatically
store the HTTP Date header value of the request to CQ in the file ctime, in other words the time of last update. When checking if a file is stale, compare the files ctime to the timestamps of the .stat files (instead of the files mtime)

This approach would not require any extra files for meta data storage and it would not have any impacts to existing installations. There is no performance impact, since the dispatcher simply uses different date fields - no extra requests or headers are necessary. Also there is no functional impact, since the file permissions typically do not change in the cache folder other than at file creation time.

See also:
http://superuser.com/questions/387042/how-to-check-all-timestamps-of-a-file
http://www.unix.com/tips-and-tutorials/20526-mtime-ctime-atime.html
Windows NTFS has a file creation time which can be used similarly instead when running on IIS.
#####

I admit this is something of a hack though, and more and more I am coming to the conclusion that Varnish is the better dispatcher anyway.

It allows far more flexibility in caching and especially flushing, as it supports full regexes where the dispatcher just has simple patterns (apart from a dozen other advantages).

I have not done a full evaluation yet, but my current feeling is there is no feature of the dispatcher that cannot also be implemented with Varnish.

Having said that, are there any thoughts at Adobe to get rid of the dispatcher altogether and make Varnish the "official" suggested caching layer/plugin? Alternatively, how about open sourcing the dispatcher code or at least making the source available? (there have been some occasions where I wanted to know what exactly goes on under the hood and reverse-engineering is cumbersome)

Avatar

Avatar

Sham_HC

Total Posts

2.1K

Likes

160

Correct Reply

1.2K

Avatar

Sham_HC

Total Posts

2.1K

Likes

160

Correct Reply

1.2K
Sham_HC

15-10-2015

agree with jorg.  At same time there was internal discussions on this not yet implemented in dispatcher please file daycare to track this.