We've recently upgraded AEM from 6.5.5 to 6.5.7 (didn't install 6.5.6) - running on Ubuntu 18.04.5, Java 11.
We have been pretty consistent the last 8-9 months after making some performance adjustments in migrating from 6.3 windows to 6.5 Unbuntu. Rarely having to restart. However, since we've upgraded from SP 6.5.5 to 6.5.7 (we bipassed 6.5.6 - which SHOULDN'T matter), we've noticed really poor performance, author instance needing restarts nearly every 2-3 days and publish instances 5-6 days because of becoming unresponsive. No heap dumps or OutOfMemory exceptions.
I'm just wondering if anybody else might know of any obvious reasons this could be happening before I bog myself down in a sea of heapdumps and threaddumps.
@sdouglasmc we also got the same package from Adobe, it resolved our issue.
Notes from adobe:
Checking the thread dumps and further researching internally, It seems you are running into a known issue(CQ-4312194) with SP7 where numerous threads get blocked due to a Timer with the Component Registry (org.apache.felix.scr.impl.ComponentRegistry). This causes the instance to become unresponsive.
Follow the steps below to resolve the issue: - Install the attached hotfix package - This will trigger a restart of a couple of bundles. So, need to wait 3-5 mins - Go to <host>:<port>/system/console and make sure the "org.apache.felix.scr" version is updated to 2.1.20
For those who haven't updated to 6.5.7 yet, Adobe has a new SP7 installation package with Hotfix-CQ-4312194-FELIX-6252 included in it. Makes upgrade faster and more stable. This is not publicly available (yet) but you can probably ask for it in a ticket.
Can you check in the logs if there are any session leaks or if there are any slow unresponsive queries ? Do you see any errors in the logs ? What do you see in health check dashboards ? Any patterns of high memory or CPU consumptions or disk utilizations ?
Can you check threaddumps. I've noticed AEM 6.5.7 hanging on following deadlock on my environments:
java.lang.Thread.State: WAITING (on object monitor)
at firstname.lastname@example.org/jdk.internal.misc.Unsafe.park(Native Method)
- waiting to lock <0x1d24408f> (a java.util.concurrent.CountDownLatch$Sync) owned by "null" tid=0x-1