Hi everyone,
My AEM Production instance is experiencing very high garbage collection CPU Usage and performance issues. I think it might be because of sessions being left open.
I am checking /system/console/jmx on one of the publishers experiencing high garbage collection (over 60% cpu usage over 10+ hours)
I know that by default, there is ~85 SessionStatistics mbeans, and right now there is ~119 SessionStatistics mbeans. I think that very large objects on the heap are being retained due to these sessions staying open. These large objects are related to Search or QueryBuilder (based on a heap dump I checked)
The sessions that are staying open and their start times are directly related to the spikes in garbage collection.
These are the sessions being started that seem to spike my GC CPU usage.
There is about 3 other sessions that are associated with each of these repository-reader-services that are staying open, Two from admin@session, and one from anonymous@session.
Is there anywhere i can find out more information about repository-reader-services sessions being started? What should my next steps be in solving this problem?
Any advice is GREATLY appreciated!
Thank you!
Ian
Solved! Go to Solution.
Hi,
The JCR queries need memory to perform the query. But when search takes a long time, and more and more queries come in, then the memory usage increase with each query.
If you gc.log doesn't show any increase in garbage collection and you don't get OutOfMemory exceptions, there is no issue with memory at all. Rather focus on improving the searches to speed them up.
Jörg
Views
Replies
Total Likes
When you say open sessions - do you mean custom services where Developers have opened sessions without closing them once done?
Views
Replies
Total Likes
I checked with Support and they replied:
One thing that they can try to do is to take an advantage on new JDK 1.8 GC
So the average customer usually has "-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:NewRatio=1" parameters in their startup script for garbage collection
change those to "-XX:NewRatio=1 -XX:MaxMetaspaceSize=2048m -XX:+UseG1GC"
Views
Replies
Total Likes
Hi,
First observation: When you have constantly about 60% CPU load on that system, it seems to me that there is more than CPU core involved. If you have problems with your Garbage collection, you typically do Full Garbage Collections, and these are executed normally by 1 thread (= 1 CPU core). So in case of constantly full garbage collections, only 1 core would run at 100%. I don't think that this is the case with you (unless your server has only 2 CPU cores, which I doubt).
I would assume, that this problem is caused by other problems.
My recommended approach would be:
* Check your garbage collection logs (I hope you have it enabled); if you have garbage collection problems, it will manifest itself there, and you have plenty of entries.
* Create threaddumps and analyze them.
I assume, that the threaddumps give most information.
Regarding these "repository-reader-sessions": What version of AEM do you use? Do you have AEM Forms deployed?
kind regards,
Jörg
Hi,
Thank you for the reply and check in with support, I will try this and update on this thread
Views
Replies
Total Likes
Hi Jorg, thank you for your response.
My version is AEM 6.2, and we do not have AEM Forms deployed at the moment. I will check threaddumps, and I have to check and see if GC logs are being recorded (I do not think they are, I haven't seen or looked at them before).
In the thread dump analyzer, this stack trace below was blocking 8 threads:
oak-repository-executor-1 - priority:5 - threadId:0x00007fe5a0ff1000 - nativeId:0x3a80 - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1855)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2068)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x0000000741632470> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
do you make anything of this?
Views
Replies
Total Likes
Hi,
that's not an issue, as this a ThreadPool worker waiting for something to do. It's not impacting your performance and also not eating up CPU. If you can provide a complete threaddump, we can have a better look at it.
Jörg
Views
Replies
Total Likes
Hi,
I posted the threaddump on my google drive: threaddump.tar.gz - Google Drive. Let me know if you can access and what you think.
Thank you again for all of your help.
Views
Replies
Total Likes
Your system is doing a lot of queries. Look for the string "/content/sites/global/en/search.html" and it will show that the system is executing a JCR query, which can take considerable time and causes load on the system.
$ grep /content/sites/global/en/search.html jstack.14721.190924.299822273
"10.43.32.115 [1507244970665] GET /content/sites/global/en/search.html HTTP/1.1" #13759 prio=5 os_prio=0 tid=0x00007fe5640e5000 nid=0x53ba runnable [0x00007fe5136a7000]
"10.43.32.115 [1507146486819] GET /content/sites/global/en/search.html HTTP/1.1" #4987 prio=5 os_prio=0 tid=0x00007fe56ccfd800 nid=0x6466 runnable [0x00007fe5242de000]
"10.43.32.115 [1507145885754] GET /content/sites/global/en/search.html HTTP/1.1" #4882 prio=5 os_prio=0 tid=0x00007fe578510000 nid=0x536c runnable [0x00007fe51d8b8000]
"10.43.32.115 [1507145746704] GET /content/sites/global/en/search.html HTTP/1.1" #4879 prio=5 os_prio=0 tid=0x00007fe564046800 nid=0x5369 runnable [0x00007fe51d1af000]
"10.43.32.115 [1507146346446] GET /content/sites/global/en/search.html HTTP/1.1" #4799 prio=5 os_prio=0 tid=0x00007fe5904ae000 nid=0x4932 runnable [0x00007fe5137a4000]
"10.43.32.115 [1507143034783] GET /content/sites/global/en/search.html HTTP/1.1" #4608 prio=5 os_prio=0 tid=0x00007fe5700f4800 nid=0x2f35 runnable [0x00007fe50f8fe000]
"10.43.32.115 [1507142962835] GET /content/sites/global/en/search.html HTTP/1.1" #4597 prio=5 os_prio=0 tid=0x00007fe58c12f800 nid=0x2ef4 runnable [0x00007fe522089000]
"10.43.32.115 [1507142434757] GET /content/sites/global/en/search.html HTTP/1.1" #4366 prio=5 os_prio=0 tid=0x00007fe568250000 nid=0x348 runnable [0x00007fe51c775000]
"10.43.32.115 [1507142362789] GET /content/sites/global/en/search.html HTTP/1.1" #3542 prio=5 os_prio=0 tid=0x00007fe588024800 nid=0x675a runnable [0x00007fe51ce72000]
$
Views
Replies
Total Likes
Hi Jorg,
The search.html component has a very big index on the /content/sites/global/en/*. I imagine this needs to be broken up into smaller indices. If these JCR Queries are causing lots of load on the system is the memory usage constantly at capacity? Wouldn't we be seeing some level of trends up and down to reflect these requests?
Do you think any of these JCR Queries are persisting on the heap?
Here are some graphs of the publisher's heap
Why are the JCR Queries persisting in memory for such a long period of time? Wouldn't that be because they are unreclaimable (in an open session)?
Thank you again for all of your help
Ian
Views
Replies
Total Likes
Hi smacdonald2008,
I am speaking to daycare about adjusting the GC parameters and trying this.
Thank you for reaching out to support!
Ian
Views
Replies
Total Likes
Hi,
The JCR queries need memory to perform the query. But when search takes a long time, and more and more queries come in, then the memory usage increase with each query.
If you gc.log doesn't show any increase in garbage collection and you don't get OutOfMemory exceptions, there is no issue with memory at all. Rather focus on improving the searches to speed them up.
Jörg
Views
Replies
Total Likes
Ok I will focus on search speed then,
Thank you for your insight and time
Views
Replies
Total Likes
Universal GC Log analyser, GCeasy's advanced machine learning algorithm saves time and hassle from dealing with cryptic CG logs. You can quickly detect memory leaks, long GC pauses, premature object promotions plus many other performance impacting problems. Powerful to tune your application's memory and GC settings. Key performance Indicators and GC statistics section of the report, helps you to tune to perfection.
Views
Replies
Total Likes