sling stops retrying failed jobs
We have an issue where after AEM has been running for several hours and retrying failed jobs as expected, sling stops retrying failed jobs for several hours. After several hours, it goes back to normal and resumes retrying the failed jobs until either they succeed or have reached the maximum number of retries and are cancelled. The job queue is configured with maximum retries set to 5 and retry delay set to 30 seconds. During the time the jobs are not being retried, in the OSGi Admin Console on the Sling -> Jobs page it shows that the queue has a "waitCount=X" where X is equal to the number of failed jobs that sling has stopped retrying. With a DEBUG logger setup for the org.apache.sling.event package we see this in the logs during the time the jobs are not being retried:
31.05.2017 15:11:54.898 *DEBUG* [pool-7-thread-4] org.apache.sling.event.impl.jobs.queues.QueueJobCache Ignoring job because java.util.GregorianCalendar[time=1496257914113,areFieldsSet=true,areAllFieldsSet=true,lenient=false,zone=sun.util.calendar.ZoneInfo[id="GMT-04:00",offset=-14400000,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=1,YEAR=2017,MONTH=4,WEEK_OF_YEAR=22,WEEK_OF_MONTH=5,DAY_OF_MONTH=31,DAY_OF_YEAR=151,DAY_OF_WEEK=4,DAY_OF_WEEK_IN_MONTH=5,AM_PM=1,HOUR=3,HOUR_OF_DAY=15,MINUTE=11,SECOND=54,MILLISECOND=113,ZONE_OFFSET=-14400000,DST_OFFSET=0] or false
From the logs and looking at the sling source code, it seems sling is ignoring the jobs because they have their "event.job.started.time" JCR property set.
This is AEM 6.1 SP1 with the following hotfixes installed:
cq-6.1.0-hotfix-11493-1.0.zip
cq-6.1.0-hotfix-10768-1.4.zip
cq-6.1.0-hotfix-8867-1.1.zip
cq-6.1.0-hotfix-9273-1.0.zip
cq-6.1.0-hotfix-6570-1.3.zip
cq-6.1.0-hotfix-6500-1.5.zip
cq-6.1.0-hotfix-6449-1.2.zip
cq-6.1.0-hotfix-6446-1.0.zip
We also tested with AEM 6.1 SP2 + CFP8 and the issue also occurs.