Hi,
We are getting 504 Gateway Timeout server error when trying to access AEM author instance in our stage author environment. In error logs we see following "Failed to create checkpoint warning messages when this happens:
14.10.2020 04:01:01.940 *WARN* [sling-oak-54-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 8a02222a-3c8d-44dc-8066-05e78ea31fad in 10 seconds.
14.10.2020 04:01:01.940 *WARN* [sling-oak-53-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-fulltext-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint a330f028-13fa-4c1f-9430-fb98fdc84cbd in 10 seconds.
14.10.2020 04:01:16.929 *WARN* [sling-oak-57-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint bc3de514-3188-4383-b138-c205a81bac68 in 10 seconds.
14.10.2020 04:01:16.938 *WARN* [sling-oak-52-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-fulltext-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 13cc8004-2cc6-4c7e-8abb-4d579670c510 in 10 seconds.
14.10.2020 04:01:31.932 *WARN* [sling-oak-54-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-fulltext-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint ffd8d722-61ef-49a9-bde5-54d43ebc7663 in 10 seconds.
14.10.2020 04:01:31.941 *WARN* [sling-oak-58-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 2d8be0dc-8fca-4a04-97d2-d0d4646cd87d in 10 seconds.
14.10.2020 04:01:46.924 *WARN* [sling-oak-53-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 54e30c6c-3ee3-401e-a8a3-75949436449b in 10 seconds.
14.10.2020 04:01:46.929 *WARN* [sling-oak-57-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-fulltext-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 1903c0fe-9f4a-47e9-b1d6-5328f9d85750 in 10 seconds.
14.10.2020 04:02:01.940 *WARN* [sling-oak-52-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint ed77ef61-3bac-4d59-b98a-4a98108abb3a in 10 seconds.
Server starts responding after a while and these warning messages also stops getting logged once server starts responding back. Last time it happened for nearly 100 minutes.
Any guidance here will be really helpful to understand the issue.
Thanks,
Bhawesh
Topics help categorize Community content and increase your ability to discover relevant content.
Hi Bhawesh,
We faced the same issue earlier, Adobe recommended us to add JVM init argument -Doak.segmentNodeStore.commitFairLock=true and to restart the cq5 service. It reduced the error rate but does not solve the issue completely.
I know it's an old issue. But were you able to resolve this issue post upgrade. Because we are currently on 6.5.14 & still facing this and our AEM instance is getting slow down.
Views
Replies
Total Likes
FYI, I'm seeing this issue in AEM 6.5.13. I'm also seeing heavier load and slow replication queue processing accompanying this.
The JVM argument:
-Doak.segmentNodeStore.commitFairLock=true
doesn't seem to resolve it. I'm planning to compact the repository and rebuild indexes.
Views
Replies
Total Likes
I know this old thread but did compaction solved your problem.
Hi @Shashi_Mulugu ,
Tar compaction and index rebuild also did not resolve our issue. Though the issue did go away for a few hours after stopping AEM, rm-all checkpoints, compacting, starting AEM, letting the index rebuild completely, and finally, after restarting the corresponding web server to allow this publish instance to receive traffic again.
The existing replication queue was processed as the index was rebuilt. But the issue returned the next evening on this publish instance. That is, very slow replication queue processing (i.e., 300-400 jobs, small files/pages to publish, backed up and processing slowly), additionally we see very low CPU idle and very high iowait on this publish instance when the issue is happening - and the corresponding dispatcher / web server starts to spawn many more httpd processes - I imagine due to the slow handling of the replication requests.
I believe this "Failed to create checkpoint" message was not actually significant to our issue (though was perhaps a symptom). I think I see this same warning sometimes on our healthy publish instances, and may have read it's a temporary file lock issue.
I will try to update here when I know more.
Views
Replies
Total Likes
I am seeing similar symptoms on our Author instance. Were you able to resolve the problem?
Thanks,
Views
Replies
Total Likes
I believe our issue was resolved by restarting the machine; perhaps restart all machines involved in case there are any locked connections between them? Good luck!
Views
Replies
Total Likes
Views
Replies
Total Likes
Views
Likes
Replies
Views
Likes
Replies
Views
Likes
Replies
Views
Likes
Replies