Expand my Community achievements bar.

sling stops retrying failed jobs

Avatar

Level 1

We have an issue where after AEM has been running for several hours and retrying failed jobs as expected, sling stops retrying failed jobs for several hours.  After several hours, it goes back to normal and resumes retrying the failed jobs until either they succeed or have reached the maximum number of retries and are cancelled.  The job queue is configured with maximum retries set to 5 and retry delay set to 30 seconds.  During the time the jobs are not being retried, in the OSGi Admin Console on the Sling -> Jobs page it shows that the queue has a "waitCount=X" where X is equal to the number of failed jobs that sling has stopped retrying.  With a DEBUG logger setup for the org.apache.sling.event package we see this in the logs during the time the jobs are not being retried:

31.05.2017 15:11:54.898 *DEBUG* [pool-7-thread-4] org.apache.sling.event.impl.jobs.queues.QueueJobCache Ignoring job because java.util.GregorianCalendar[time=1496257914113,areFieldsSet=true,areAllFieldsSet=true,lenient=false,zone=sun.util.calendar.ZoneInfo[id="GMT-04:00",offset=-14400000,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=1,YEAR=2017,MONTH=4,WEEK_OF_YEAR=22,WEEK_OF_MONTH=5,DAY_OF_MONTH=31,DAY_OF_YEAR=151,DAY_OF_WEEK=4,DAY_OF_WEEK_IN_MONTH=5,AM_PM=1,HOUR=3,HOUR_OF_DAY=15,MINUTE=11,SECOND=54,MILLISECOND=113,ZONE_OFFSET=-14400000,DST_OFFSET=0] or false

From the logs and looking at the sling source code, it seems sling is ignoring the jobs because they have their "event.job.started.time" JCR property set.

This is AEM 6.1 SP1 with the following hotfixes installed:

cq-6.1.0-hotfix-11493-1.0.zip

cq-6.1.0-hotfix-10768-1.4.zip

cq-6.1.0-hotfix-8867-1.1.zip

cq-6.1.0-hotfix-9273-1.0.zip

cq-6.1.0-hotfix-6570-1.3.zip

cq-6.1.0-hotfix-6500-1.5.zip

cq-6.1.0-hotfix-6449-1.2.zip

cq-6.1.0-hotfix-6446-1.0.zip

We also tested with AEM 6.1 SP2 + CFP8 and the issue also occurs.

4 Replies

Avatar

Level 10

I am checking with the Support team - sounds like a bug,

Avatar

Level 9

Hi Bill,

    Configure "Maximum Parallel Jobs" to half of available processors and verify.  

Thanks,

Avatar

Level 1

We have tried the following and the issue still occurs with each:

1. setting problematic queue max parallel jobs to 1/2 available processors

2. setting all queues max parallel jobs to 1/2 available processors

3. setting problematic queue max parallel jobs to -1 (to auto set to # of available processors)

4. setting all queues max parallel jobs to -1 (to auto set to # of available processors)

Avatar

Level 1

Thank you smacdonald2008.  By any chance have you heard anything back from the support team?