I wonder if you guys could shed some light or assist in methods of resolving our current issue we experiencing on our AEM platform for our new brochure site. We currently using AEM 6.1 SP2 and are starting to experience issues with multiple users capturing content on the author concurrently. Initially we had no issues on the instance, but over the last 2 weeks as usage and more content is being pushed into the author we are experiencing multiple intermittent incidents of users having timeouts, failures to publish and other random errors on the Author. We have about 20 concurrent content authors on the production instance during peak hours when these intermittent issues occur.
We have done multiple investigations into the symptoms/errors and found ultimately that the strain the disk IO is experiencing seems to be the cause of all the errors. The average page request, login or publish spikes the IO to around 95-105 MBps (note MB/s). This seems abnormally high, so much so that the average disk utilization sits at around 100% on the mount that hosts our AEM instance, with the IOWAIT averaging around 50% and higher, when our content authors are at their busiest. Sometimes just clicking around in 'Assets' on the Author pushes the IO up to 100MB/s and over. Thus explaining why the disk cannot commit or satisfy certain requests and either results in timeouts or 504 errors with users not being able to connect to the site.
We are hosting this platform in AWS on very robust instances with provisioned IOPS disk volumes. The memory and CPU on this instance are well within their limits, so no issues on those fronts. The author instance is being accessed via Cloudfare by the users.
If you are able to advise on, if this is normal for AEM authoring to put so much of strain on IO and if AEM author instances are designed to handle 20 concurrent users all publishing content, upload assets etc. at the same time. Are there any tweaks we can introduce on the Author instance to aid this? Any assistance would be much appreciated!
Views
Replies
Total Likes
Hi,
can you share the start parameters you are using for AEM?
20 concurrent users does not seem high at all, but depends on what they are doing. Are you carrying out your maintenance tasks, such as audit/workflow purge, DSGC and offline compaction on a regular basis?
Have you determined which requests are taking the longest by analysing the log files using rlog.jar Best Practices for Performance Testing - docs.adobe.com
When you say they are uploading assets, how big and how often?
Regards,
Opkar
Views
Replies
Total Likes
Hi Opkar, thank you for the reply. Please see startup parameter below:
java -server -Xmx12288m -XX:MaxPermSize=12288M -Djava.awt.headless=true -Djavax.net.ssl.trustStore=/apps/AEM/crx-quickstart/ssl/cacerts -Dsling.run.modes=author,prod,crx3,crx3tar -jar crx-quickstart/app/cq-quickstart-6.1.0-standalone-quickstart.jar start -c crx-quickstart -i launchpad -p 4502 -Dsling.properties=conf/sling.properties
We are testing offline compaction currently on our staging environment before taking it to production, but have not run it as yet on this environment. The environment is running for about 4 months now. I am also in the process of upgrading Oak Core to 1.2.24, done it on staging already. Production is currently on 1.2.18. Running CFP 8 currently on the production instance.
Revision cleanup (online compaction) is disabled as it proves to not really work and we are running workflow purging maintenance.
Views
Replies
Total Likes
Additionally to Opkar's question, have you performed threaddumps to evaluate what's actually causing the disk I/O?
Jörg
Views
Replies
Total Likes
Views
Likes
Replies
Views
Likes
Replies