Hi Experience League Community,
We’ve been observing a recurring issue where our publish instances experience outages (showing a whitepage with a 504 error), and during these events, telemetry data shows a noticeable spike and plateau in transactions related to startJob 10-15 minutes before the downtime. These transactions remain consistently high until the publishers go down, in which case, it will turn to 0.
Upon inspecting the logs of our publishers, the only traces that we could find related to startJob is its presence in the stack trace of an unclosed ResourceResolver warning:
*WARN* [Apache Sling Resource Resolver Finalizer Thread] org.apache.sling.resourceresolver.impl.CommonResourceResolverFactoryImpl Closed unclosed ResourceResolver. The creation stacktrace is available on info log level.
*INFO* [Apache Sling Resource Resolver Finalizer Thread] org.apache.sling.resourceresolver.impl.CommonResourceResolverFactoryImpl Unclosed ResourceResolver was created here:
java.lang.Exception: Opening Stacktrace
at org.apache.sling.resourceresolver.impl.CommonResourceResolverFactoryImpl$ResolverReference.<init>(...)
at org.apache.sling.resourceresolver.impl.CommonResourceResolverFactoryImpl.register(...)
at org.apache.sling.resourceresolver.impl.ResourceResolverImpl.<init>(...)
...
at com.day.cq.dam.usage.impl.listener.AssetUsageListener.process(...)
at org.apache.sling.event.impl.jobs.JobConsumerManager$JobConsumerWrapper.process(...)
at org.apache.sling.event.impl.jobs.queues.JobQueueImpl.startJob(...)
...
We’re trying to understand:
- What could be causing this spike in startJob transactions before the publisher outages?
- Could this spike and plateau in startJob transactions be the cause of our publisher outages?
- Could this be linked to unclosed ResourceResolver instances or a specific job/transaction type?
- What steps can we take to identify the jobs or transactions involved and mitigate the issue?
Any insights or suggestions on how to further investigate this would be greatly appreciated.
Thanks in advance!