Expand my Community achievements bar.

Don’t miss the AEM Skill Exchange in SF on Nov 14—hear from industry leaders, learn best practices, and enhance your AEM strategy with practical tips.
SOLVED

Investigating the cause of open sessions and bad site performance!

Avatar

Level 3

Hi everyone,

My AEM Production instance is experiencing very high garbage collection CPU Usage and performance issues. I think it might be because of sessions being left open.

I am checking /system/console/jmx on one of the publishers experiencing high garbage collection (over 60% cpu usage over 10+ hours)

I know that by default, there is ~85 SessionStatistics mbeans, and right now there is ~119 SessionStatistics mbeans. I think that very large objects on the heap are being retained due to these sessions staying open. These large objects are related to Search or QueryBuilder (based on a heap dump I checked)

The sessions that are staying open and their start times are directly related to the spikes in garbage collection.

org.apache.jackrabbit.oakSessionStatistics"repository-reader-service@session-434675@Oct 4, 2017 2:39:22 PM"
org.apache.jackrabbit.oakSessionStatistics"repository-reader-service@session-435491@Oct 4, 2017 2:40:34 PM"
org.apache.jackrabbit.oakSessionStatistics"repository-reader-service@session-442302@Oct 4, 2017 2:49:22 PM"
org.apache.jackrabbit.oakSessionStatistics"repository-reader-service@session-443169@Oct 4, 2017 2:50:34 PM"
org.apache.jackrabbit.oakSessionStatistics"repository-reader-service@session-467825@Oct 4, 2017 3:35:40 PM"
org.apache.jackrabbit.oakSessionStatistics"repository-reader-service@session-468804@Oct 4, 2017 3:38:06 PM"
org.apache.jackrabbit.oakSessionStatistics"repository-reader-service@session-471911@Oct 4, 2017 3:45:39 PM"
org.apache.jackrabbit.oakSessionStatistics"repository-reader-service@session-472840@Oct 4, 2017 3:47:59 PM"

These are the sessions being started that seem to spike my  GC CPU usage.

1318577_pastedImage_1.png

There is about 3 other sessions that are associated with each of these repository-reader-services that are staying open, Two from admin@session, and one from anonymous@session.

Is there anywhere i can find out more information about repository-reader-services sessions being started? What should my next steps be in solving this problem?

Any advice is GREATLY appreciated!

Thank you!

Ian

1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

Hi,

The JCR queries need memory to perform the query. But when search takes a long time, and more and more queries come in, then the memory usage increase with each query.

If you gc.log doesn't show any increase in garbage collection and you don't get OutOfMemory exceptions, there is no issue with memory at all. Rather focus on improving the searches to speed them up.

Jörg

View solution in original post

13 Replies

Avatar

Level 10

When you say open sessions - do you mean custom services where Developers have opened sessions without closing them once done?

Avatar

Level 10

I checked with Support and they replied:

One thing that they can try to do is to take an advantage on new JDK 1.8 GC

So the average customer usually has "-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:NewRatio=1" parameters in their startup script for garbage collection

change those to "-XX:NewRatio=1 -XX:MaxMetaspaceSize=2048m -XX:+UseG1GC"

Avatar

Employee Advisor

Hi,

First observation: When you have constantly about 60% CPU load on that system, it seems to me that there is more than CPU core involved. If you have problems with your Garbage collection, you typically do Full Garbage Collections, and these are executed normally by 1 thread (= 1 CPU core). So in case of constantly full garbage collections, only 1 core would run at 100%. I don't think that this is the case with you (unless your server has only 2 CPU cores, which I doubt).

I would assume, that this problem is caused by other problems.

My recommended approach would be:

* Check your garbage collection logs (I hope you have it enabled); if you have garbage collection problems, it will manifest itself there, and you have plenty of entries.

* Create threaddumps and analyze them.

I assume, that the threaddumps give most information.

Regarding these "repository-reader-sessions": What version of AEM do you use? Do you have AEM Forms deployed?

kind regards,
Jörg

Avatar

Level 3

Hi,

Thank you for the reply and check in with support, I will try this and update on this thread

Avatar

Level 3

Hi Jorg, thank you for your response.

My version is AEM 6.2, and we do not have AEM Forms deployed at the moment. I will check threaddumps, and I have to check and see if GC logs are being recorded (I do not think they are, I haven't seen or looked at them before).

In the thread dump analyzer, this stack trace below was blocking 8 threads:

oak-repository-executor-1

oak-repository-executor-1 - priority:5 - threadId:0x00007fe5a0ff1000 - nativeId:0x3a80 - state:RUNNABLE

stackTrace:

java.lang.Thread.State: RUNNABLE

at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1855)

at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2068)

at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)

at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)

at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:

- <0x0000000741632470> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

1318901_pastedImage_0.png

do you make anything of this?

Avatar

Employee Advisor

Hi,

that's not an issue, as this a ThreadPool worker waiting for something to do. It's not impacting your performance and also not eating up CPU. If you can provide a complete threaddump, we can have a better look at it.

Jörg

Avatar

Level 3

Hi,

Avatar

Employee Advisor

Your system is doing a lot of queries. Look for the string "/content/sites/global/en/search.html" and it will show that the system is executing a JCR query, which can take considerable time and causes load on the system.

$ grep /content/sites/global/en/search.html jstack.14721.190924.299822273 
"10.43.32.115 [1507244970665] GET /content/sites/global/en/search.html HTTP/1.1" #13759 prio=5 os_prio=0 tid=0x00007fe5640e5000 nid=0x53ba runnable [0x00007fe5136a7000]
"10.43.32.115 [1507146486819] GET /content/sites/global/en/search.html HTTP/1.1" #4987 prio=5 os_prio=0 tid=0x00007fe56ccfd800 nid=0x6466 runnable [0x00007fe5242de000]
"10.43.32.115 [1507145885754] GET /content/sites/global/en/search.html HTTP/1.1" #4882 prio=5 os_prio=0 tid=0x00007fe578510000 nid=0x536c runnable [0x00007fe51d8b8000]
"10.43.32.115 [1507145746704] GET /content/sites/global/en/search.html HTTP/1.1" #4879 prio=5 os_prio=0 tid=0x00007fe564046800 nid=0x5369 runnable [0x00007fe51d1af000]
"10.43.32.115 [1507146346446] GET /content/sites/global/en/search.html HTTP/1.1" #4799 prio=5 os_prio=0 tid=0x00007fe5904ae000 nid=0x4932 runnable [0x00007fe5137a4000]
"10.43.32.115 [1507143034783] GET /content/sites/global/en/search.html HTTP/1.1" #4608 prio=5 os_prio=0 tid=0x00007fe5700f4800 nid=0x2f35 runnable [0x00007fe50f8fe000]
"10.43.32.115 [1507142962835] GET /content/sites/global/en/search.html HTTP/1.1" #4597 prio=5 os_prio=0 tid=0x00007fe58c12f800 nid=0x2ef4 runnable [0x00007fe522089000]
"10.43.32.115 [1507142434757] GET /content/sites/global/en/search.html HTTP/1.1" #4366 prio=5 os_prio=0 tid=0x00007fe568250000 nid=0x348 runnable [0x00007fe51c775000]
"10.43.32.115 [1507142362789] GET /content/sites/global/en/search.html HTTP/1.1" #3542 prio=5 os_prio=0 tid=0x00007fe588024800 nid=0x675a runnable [0x00007fe51ce72000]
$

Avatar

Level 3

Hi Jorg,

The search.html component has a very big index on the /content/sites/global/en/*. I imagine this needs to be broken up into smaller indices. If these JCR Queries are causing lots of load on the system is the memory usage constantly at capacity? Wouldn't we be seeing some level of trends up and down to reflect these requests?

Do you think any of these JCR Queries are persisting on the heap?

Here are some graphs of the publisher's heap

https://drive.google.com/open?id=0Bwsk1CHN-EX9MEd5SkRQOFQwNGs

Why are the JCR Queries persisting in memory for such a long period of time? Wouldn't that be because they are unreclaimable (in an open session)?

Thank you again for all of your help

Ian

Avatar

Level 3

Hi smacdonald2008,

I am speaking to daycare about adjusting the GC parameters and trying this.

Thank you for reaching out to support!

Ian

Avatar

Correct answer by
Employee Advisor

Hi,

The JCR queries need memory to perform the query. But when search takes a long time, and more and more queries come in, then the memory usage increase with each query.

If you gc.log doesn't show any increase in garbage collection and you don't get OutOfMemory exceptions, there is no issue with memory at all. Rather focus on improving the searches to speed them up.

Jörg

Avatar

Level 3

Ok I will focus on search speed then,

Thank you for your insight and time

Avatar

Level 1

Universal GC Log analyser, GCeasy's advanced machine learning algorithm saves time and hassle from dealing with cryptic CG logs. You can quickly detect memory leaks, long GC pauses, premature object promotions plus many other performance impacting problems. Powerful to tune your application's memory and GC settings. Key performance Indicators and GC statistics section of the report, helps you to tune to perfection.