We have an issue where after AEM has been running for several hours and retrying failed jobs as expected, sling stops retrying failed jobs for several hours. After several hours, it goes back to normal and resumes retrying the failed jobs until either they succeed or have reached the maximum number of retries and are cancelled. The job queue is configured with maximum retries set to 5 and retry delay set to 30 seconds. During the time the jobs are not being retried, in the OSGi Admin Console on the Sling -> Jobs page it shows that the queue has a "waitCount=X" where X is equal to the number of failed jobs that sling has stopped retrying. With a DEBUG logger setup for the org.apache.sling.event package we see this in the logs during the time the jobs are not being retried:
31.05.2017 15:11:54.898 *DEBUG* [pool-7-thread-4] org.apache.sling.event.impl.jobs.queues.QueueJobCache Ignoring job because java.util.GregorianCalendar[time=1496257914113,areFieldsSet=true,areAllFieldsSet=true,lenient=false,zone=sun.util.calendar.ZoneInfo[id="GMT-04:00",offset=-14400000,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=1,YEAR=2017,MONTH=4,WEEK_OF_YEAR=22,WEEK_OF_MONTH=5,DAY_OF_MONTH=31,DAY_OF_YEAR=151,DAY_OF_WEEK=4,DAY_OF_WEEK_IN_MONTH=5,AM_PM=1,HOUR=3,HOUR_OF_DAY=15,MINUTE=11,SECOND=54,MILLISECOND=113,ZONE_OFFSET=-14400000,DST_OFFSET=0] or false
From the logs and looking at the sling source code, it seems sling is ignoring the jobs because they have their "event.job.started.time" JCR property set.
This is AEM 6.1 SP1 with the following hotfixes installed:
cq-6.1.0-hotfix-11493-1.0.zip
cq-6.1.0-hotfix-10768-1.4.zip
cq-6.1.0-hotfix-8867-1.1.zip
cq-6.1.0-hotfix-9273-1.0.zip
cq-6.1.0-hotfix-6570-1.3.zip
cq-6.1.0-hotfix-6500-1.5.zip
cq-6.1.0-hotfix-6449-1.2.zip
cq-6.1.0-hotfix-6446-1.0.zip
We also tested with AEM 6.1 SP2 + CFP8 and the issue also occurs.
I am checking with the Support team - sounds like a bug,
Views
Replies
Total Likes
Hi Bill,
Configure "Maximum Parallel Jobs" to half of available processors and verify.
Thanks,
We have tried the following and the issue still occurs with each:
1. setting problematic queue max parallel jobs to 1/2 available processors
2. setting all queues max parallel jobs to 1/2 available processors
3. setting problematic queue max parallel jobs to -1 (to auto set to # of available processors)
4. setting all queues max parallel jobs to -1 (to auto set to # of available processors)
Views
Replies
Total Likes
Thank you smacdonald2008. By any chance have you heard anything back from the support team?
Views
Replies
Total Likes
Views
Likes
Replies
Views
Likes
Replies
Views
Likes
Replies