We've had memory leak issues for a while that will eventually push our publishers into 100% GC after a week or two. Up until now we've just been ignoring them because it's easier to restart the publisher but i would like to really understand how to troubleshoot this. We have a heap dump but what i'm seeing doesn't seem that helpful. All of the classes it references are framework classes, so i'm not sure how to proceed in finding the actual cause of the leak in our code. Below is the main leak suspect "HttpListener" loaded by "BundleWiringImpl" -
There are hundreds of these instances, each with a URL that is called by the end user. The below example is a keepalive call to a static html page, so none of our custom code should even be running.
Does anyone have suggestions on how to proceed here? Every time we take a heap dump the problem suspects are from "org.apache.felix.framework.BundleWiringImpl$BundleClassLoaderJava", "com.day.j2ee.servletengine.HttpListener", and "com.day.j2ee.servletengine.ServletHandlerImpl".
We are still on 5.6.1
Solved! Go to Solution.
Thanks for that, i did read through those earlier but they seem to end about where i am now. The examples i've found all have obvious leak suspects, such as a custom class, so i'm not sure how to handle the leak suspects being part of CQ's framework classes. It just seems like all HTTP/Servlet calls are causing memory leaks.
It is obvious from the screenshot you have uploaded.
Closetion.123I.html and ic_kal.html seem to be retaining a lot of heap space. it's likely these requests are getting stuck. you might also want to verify the underlying page component.
there are several online documents that you can refer to for analyzing heap dumps. the one below has helped me numerous times in resolving memory leak issues. hopefully it would serve you the same.
I'll take a look at that document, thanks.
For ic_kal.html, it is literally an html file stored under /content with - "<html><body>OK</body></html>", so there is no template or underlying page component that i'm aware of. That's why i chose this as an example, because it shouldn't be doing any processing besides serving the doc. However, i'm guessing if there are other pages causing issues this one could have just been caught after we were already at 100% GC
Hm, that's definitely a possibility because we do connect to multiple other systems. What did you see that led you to that conclusion?
http thread not released & it can be external connection most of time. Also based on experience since we have many backend integration with other legacy system line number & class are familiar. If you can send heap dump easy to figure out culprit. Problem is heap dump will have all the data including some of your security environment info & hence not good idea to discuss on open forums.
I agree, external connection could be an issue here since Sling does return back HTTP to the thread pool. One quicker way to know all this will be taking 20 thread dumps every 500ms and then going through those which are running or in waiting state. Hopefully you will get something interesting there. However one thing I would like to know- what is the thread pool size configured in this server?
We are also facing similar issues.
Everyday we are seeing publisher 1 or publisher 2 having issues (Major issue is old generation space reached 100% )and due to this admin is just restarting the instances.
Below is the link for head dump analysis.
This may be the memory issue but we don't have clues in application. Someone please verify link and suggest.
For memory-related issues, you should also take a heap dumps along with thread dumps. The link that you have provided is for thread dumps, not heap dumps.
Follow the link  to see instructions to take heap dumps. Also, review the memory usage here: http://aem-host:port/system/console/memoryusage screen
I suggest you open a Daycare ticket and provide heap dumps, thread dumps, aem logs etc for analysis.