content invalidation best practices
Hi there,
we are experiencing significant issues within our AEM infrastructure caused by massive content updates together with parallel dispatcher cache invalidation. Cache invalidation is driven by agents on the publish instances. One of the symptoms is the publish queues getting stuck, probably due to heavy load on the publishers which have to handle activation and rendering at the same time.
Currently, any content update being pushed from author to publisher invalidates the complete dispatcher cache which (beside the heavy load) also leads to a poor cache hit ratio.
The problem seems to get exacerbated further by the fact that we have multiple publishers and dispatchers in a many-to-many (spaghetti) relationships which seems to lead to "invalidation storms" as content getting replicated to multiple publishers lead to repeated invalidation on the dispatchers.
It would be helpful if you could provide some pointers whether such problems are common and should be solved either via
- infra remediation (physical partitioning of CQ5 infrastructures in website specific farms)
- seems to be quite the opposite of "scale out" and modern cloud architectures
- content model changes
- using folder level (partial) invalidation (multiple stat files) which would at least reduce the amount of invalidated assets
- Are there any content modelling guidelines which allow for better management of asset dependencies (i.e. not requiring the invalidation of the whole website if a label is changed)
- custom invalidation/deployment workflows
- A custom workflow runs on the author which after publishing to the activation queues, monitors these queues and once the queue is empty again, invalidation requests are sent to the individual dispatchers. This would remove race conditions (activation vs cache invalidation) as well as avoid invalidating the cache multiple times for a single update.
Thx in advance,
Nick