On our production environment we have two publish instance. Site is load balance by dispatcher on both publish.
16 GB Memory available on both server but most of the time memory consumption high and due to that the server hung.
Mem: 16335344k total, 14835632k used, 1499712k free, 3486092k buffers
Swap: 4194300k total, 111016k used, 4083284k free, 4851876k cached
Also in logs I can see the below error which i think seems problematic.
28.05.2018 01:12:05.738 *ERROR* [Process Executor for diskusage.sh] com.adobe.granite.monitoring.impl.ShellScriptExecutorImpl Error while executing script /data/apps/aem/publish/crx-quickstart/monitoring/diskusage.sh
similar for cpu.sh
Could you please help us to provide the ways to free up the memory ? Do i have to do any changes in configurations?
Yes Jorg ,
Sometimes it consume full 16 GB memory and due to that server went into hung state.
Mem: 16335344k total, 16127708k used, 207636k free, 2456296k buffers
Swap: 4194300k total, 152456k used, 4041844k free, 4437592k cached
Sure I will start the structured process to find root cause as mentioned by kunwar and Mac.
Thank you for the link.
One question from the link :
Tuning Sling Job Queues : The maximum number of parallel jobs started for this queue. - As per the documentation we should set the queue.maxparallel value 50% of CPU cores.
so we have 8 core CPU - so it should be 4 right. But we have 15 defined in configuration on environment.
Do I have to change it to 4 ? please guide. Thank you.
Before you change anything, you should start into a rigorous analysis phase:
* Are you absolutely sure that the AEM instance is the root cause of the problem? Have you eliminated webservers, networks, loadbalancers and the client systems as candidates for the problem you see? At the moment I don't see a causal relation between "the memory usage is high" (the data you showed are signs of a busy system, but not indicating problems per se) and "the server is hung".
* What does "the server is hung" mean? Can you prove that with logs? request.log? garbage collection logs?
* How often does this happen? Does it correlate with other external events?
Please, before you change anything, make sure that you actually understand the problem.