Expand my Community achievements bar.

Getting Failed to create checkpoint message with AsyncIndexUpdate

Avatar

Level 2

Hi,

 

We are getting 504 Gateway Timeout server error when trying to access AEM author instance in our stage author environment. In error logs we see following "Failed to create checkpoint warning messages when this happens:

 

14.10.2020 04:01:01.940 *WARN* [sling-oak-54-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 8a02222a-3c8d-44dc-8066-05e78ea31fad in 10 seconds.
14.10.2020 04:01:01.940 *WARN* [sling-oak-53-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-fulltext-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint a330f028-13fa-4c1f-9430-fb98fdc84cbd in 10 seconds.
14.10.2020 04:01:16.929 *WARN* [sling-oak-57-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint bc3de514-3188-4383-b138-c205a81bac68 in 10 seconds.
14.10.2020 04:01:16.938 *WARN* [sling-oak-52-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-fulltext-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 13cc8004-2cc6-4c7e-8abb-4d579670c510 in 10 seconds.
14.10.2020 04:01:31.932 *WARN* [sling-oak-54-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-fulltext-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint ffd8d722-61ef-49a9-bde5-54d43ebc7663 in 10 seconds.
14.10.2020 04:01:31.941 *WARN* [sling-oak-58-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 2d8be0dc-8fca-4a04-97d2-d0d4646cd87d in 10 seconds.
14.10.2020 04:01:46.924 *WARN* [sling-oak-53-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 54e30c6c-3ee3-401e-a8a3-75949436449b in 10 seconds.
14.10.2020 04:01:46.929 *WARN* [sling-oak-57-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-fulltext-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint 1903c0fe-9f4a-47e9-b1d6-5328f9d85750 in 10 seconds.
14.10.2020 04:02:01.940 *WARN* [sling-oak-52-org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate-async] org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler Failed to create checkpoint ed77ef61-3bac-4d59-b98a-4a98108abb3a in 10 seconds.

 

Server starts responding after a while and these warning messages also stops getting logged once server starts responding back. Last time it happened for nearly 100 minutes.

 

Any guidance here will be really helpful to understand the issue.

 

Thanks,

Bhawesh

Topics

Topics help categorize Community content and increase your ability to discover relevant content.

11 Replies

Avatar

Level 3

Hi Bhawesh,
We faced the same issue earlier, Adobe recommended us to add JVM init argument -Doak.segmentNodeStore.commitFairLock=true and to restart the cq5 service. It reduced the error rate but does not solve the issue completely.

 

Avatar

Level 1
The error just shows that during that time some other thread was holding the lock and as a result a checkpoint couldn't be created. Checkpoints should have been created at other times. You can perform routine checks (oak-run check) to confirm that checkpoints are indeed being created and there are good revisions available to which repo can be reverted to in times of crisis. Please add JVM parameter -Doak.segmentNodeStore.commitFairLock=true and restart AEM this should help resolve the issue

Avatar

Level 2
What we noticed is, it happens when the repository is getting locked to do any write operations, even users can't login in the system. AMS is pointing to issues related to indexing and there are few defects in the oak version we are using with 6.4.2.0. They asked us to upgrade to at least 6.4.8.2. We are anyway upgrading to 6.5 now. Let's hope if it resolves the issue..

Avatar

Level 2

I know it's an old issue. But were you able to resolve this issue post upgrade. Because we are currently on 6.5.14 & still facing this and our AEM instance is getting slow down.

Avatar

Level 1
did you try -Doak.segmentNodeStore.commitFairLock=true, it should have helped

Avatar

Level 5

FYI, I'm seeing this issue in AEM 6.5.13. I'm also seeing heavier load and slow replication queue processing accompanying this.

 

The JVM argument:

-Doak.segmentNodeStore.commitFairLock=true 

doesn't seem to resolve it. I'm planning to compact the repository and rebuild indexes.

 

Avatar

Community Advisor

I know this old thread but did compaction solved your problem.

Hi @Shashi_Mulugu ,

Tar compaction and index rebuild also did not resolve our issue. Though the issue did go away for a few hours after stopping AEM, rm-all checkpoints, compacting, starting AEM, letting the index rebuild completely, and finally, after restarting the corresponding web server to allow this publish instance to receive traffic again.

The existing replication queue was processed as the index was rebuilt. But the issue returned the next evening on this publish instance. That is, very slow replication queue processing (i.e., 300-400 jobs, small files/pages to publish, backed up and processing slowly), additionally we see very low CPU idle and very high iowait on this publish instance when the issue is happening - and the corresponding dispatcher / web server starts to spawn many more httpd processes - I imagine due to the slow handling of the replication requests.

I believe this "Failed to create checkpoint" message was not actually significant to our issue (though was perhaps a symptom). I think I see this same warning sometimes on our healthy publish instances, and may have read it's a temporary file lock issue. 

I will try to update here when I know more.

Avatar

Level 1

I am seeing similar symptoms on our Author instance.   Were you able to resolve the problem?


Thanks,

Avatar

Level 5

I believe our issue was resolved by restarting the machine; perhaps restart all machines involved in case there are any locked connections between them? Good luck!

Avatar

Administrator
@singhpal, good to see your contribution in the AEM community. Keep it up.


Kautuk Sahni