Expand my Community achievements bar.

SOLVED

Extremely high JVM Heap Memory Usage

Avatar

Level 2

Hi,

 

as background I'm fairly new to AEM and have to look into this platform that has been around at the company for over 5 years and changed ownership a couple of times. To me it's a bit of a black box. I have read all the Adobe documentation regarding the architecture of AEM, Hardware Sizing Guidelines, Performance Tuning Tips etc.

 

What struck me is that our instances (both author and publisher) are insanely large (122 GB RAM), which seems excessive to me considering what the platform is doing. If I go by the sizing guideline it recommends around 2G JVM heap memory. I found some IaC templates on Github that deploy an AEM stuck and their defaults have 4/8 GB as min/max JVM heap.  

 

I created a heap dump and loaded it in VisualVM and it's 95% bloated up with String class instances over 100 Mio instances taking up over 30 GB. After calculating references and following them through some referenced hash maps they have some Lucene PropertyIndex as root. I read up a bit about the Oak repository and Lucene. However, I have a hard time believing that this is normal and the platform needs to keep all of that stuff in memory. It's all just fragments of pages or even complete HTML pages. Millions of them. 

 

When I look at the memory usage of the instance over time it's a rollercoaster. Suddenly it shoots up by 50GB then climbs a bit and stays high for weeks but also falls off the cliff sometimes to still very high levels before shooting up again.

 

Anyone with some more experience can chime in if that's reasonable and considered normal if you have a repository of a certain size? Any pointers what to look into next?

1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

122G of heap is indeed excessive. I assume that you have no indication or documentation trail why this number was set that high (tickets or something like that).

 

In AEM onprem setups most often I set the heap size to something between 8 and 16G, with the most prominent input parameters for this being:

  • the number of parallel running asset workflows
  • the size and dimensions of the largest assets you are processing.

 

While transcoding/resizing of videos is typically offloaded to something like imagemagick or ffmpeg, there are still a lot of operations in AEM 6.5 workflows (and of course also custom code), which creates a Java layer object from an asset binary, and that can indeed consume a lot of memory (as you need also to include all operations on it).

 

That means, in most cases even these huge setups would work perfectly with 16G of heap, but when (without further notice) such large assets need to be processed, it's not sufficient anymore; and for these rare situations the heap has been increased to that ridiculous number.

 

My recommendation:

  1. Try to identify if you have large assets in your system (gigabytes) and how they got/get processed.
  2. Limit the concurrency on these operations
  3. See if you have avoid these in-memory processings and offload it to a different process.

 

 

 

View solution in original post

6 Replies

Avatar

Community Advisor

2 gb is the minimum requirement for aem to start. Memory consumption depends on how large your repo is. 
However you can check on below points.

1) any custom index deployed on aem 

2) is dispatcher is doing the caching for most of the content. Dispatcher cache hit ratio more than 95% is considered good.

3) you can check the logs for any exceptions or opened aem sessions

4) if you think your code is using lots of operations on the string try changing it with either string builder or string buffer as string is immutable.

Avatar

Community Advisor

@ryg457 

 

Memory Fluctuations:

Large, sudden spikes in memory usage may indicate specific events or processes triggering increased resource consumption. Investigate the logs during these periods to identify any concurrent activities or scheduled jobs that coincide with the spikes.

Also, you would need to analyze the Head dump when spikes occur. Details here: https://experienceleague.adobe.com/docs/experience-cloud-kcs/kbarticles/KA-17482.html?lang=en

https://docs.mktossl.com/docs/experience-cloud-kcs/kbarticles/KA-17499.html 

 

 

Maintenance Tasks:

There are various maintenance tasks that should execute regularly to maintain the health of the system. Please cross-check if they are configured

https://helpx.adobe.com/customer-care-office-hours/aem/6x-maintenance-tasks.html

 

Review code:

  • Avoid Object.wait() kind of implementations, which will cause multiple threads to execute in parallel, while they wait over using common resources.
  • Avoid long running sessions
  • Fine-tune indexes for queries

 

Use dispatcher and CDN cache

  • Dispatcher: By serving cached content for anonymous or read-only requests, the AEM server experiences reduced load, freeing up resources to handle more complex and personalized requests.

  • CDN: CDNs absorb a significant portion of the incoming traffic, offloading the origin server and reducing the risk of server overload during traffic peaks.

 

 

You would also be able to find many threads on the Adobe community platform on similar issue.


Aanchal Sikka

Avatar

Level 2

Thanks, regarding dispatcher and CDN that is in place and doing it's job. 

 

My current impression is that the Lucene index is going crazy. I have not fully understood why it is needed in the first place.

Avatar

Community Advisor

@ryg457 

 

Lucene indexes are needed for queries.  Oak does not index content by default. Custom indexes must be created when necessary, much like with traditional relational databases. If there is no index for a specific query, many nodes are possibly traversed. The query may still work but it is likely slow. And if the result set if large, queries might fail too, to avoid too many traversals/

 

Please refer to following links for understanding and debugging:

https://experienceleague.adobe.com/docs/experience-cloud-kcs/kbarticles/KA-17492.html?lang=en

https://experienceleague.adobe.com/docs/experience-manager-65/content/implementing/deploying/practic...

https://docs.mktossl.com/docs/experience-manager-65/content/implementing/deploying/deploying/queries... 


Aanchal Sikka

Avatar

Correct answer by
Employee Advisor

122G of heap is indeed excessive. I assume that you have no indication or documentation trail why this number was set that high (tickets or something like that).

 

In AEM onprem setups most often I set the heap size to something between 8 and 16G, with the most prominent input parameters for this being:

  • the number of parallel running asset workflows
  • the size and dimensions of the largest assets you are processing.

 

While transcoding/resizing of videos is typically offloaded to something like imagemagick or ffmpeg, there are still a lot of operations in AEM 6.5 workflows (and of course also custom code), which creates a Java layer object from an asset binary, and that can indeed consume a lot of memory (as you need also to include all operations on it).

 

That means, in most cases even these huge setups would work perfectly with 16G of heap, but when (without further notice) such large assets need to be processed, it's not sufficient anymore; and for these rare situations the heap has been increased to that ridiculous number.

 

My recommendation:

  1. Try to identify if you have large assets in your system (gigabytes) and how they got/get processed.
  2. Limit the concurrency on these operations
  3. See if you have avoid these in-memory processings and offload it to a different process.

 

 

 

Avatar

Level 2

Thanks,
just to clarify 122 GB is system memory of the instances. I'm currently creating a clone, which I can attach a profiler to while it's running. I suspect that it's somewhat Search related from the inspection of the Heap Dump. Still don't get why it would keep all of that in memory. The goal is to optimize and right size it so that the instance can get along with 16 GB RAM, which seems more reasonable to me for a production server.

 

Will update the thread once I find the real culprit.

 

Regards,