Over the past weekend our production author instance became mostly unresponsive. It was odd, we could log in to crx console but not sites, assets, or system/console. It didn't render pages either.
Restarting the instance brought it back up and seems okay now, but obviously, this makes us a bit nervous. This could have been bad news if it were a publish that got into this state. Potentially relevant logs are posted at the end. I can post more logs if needed.
We are currently implementing jmx monitoring through SolarWinds and would like to make sure that we set alerts appropriately so that we can catch such issues but there are more than a thousand mbeans
What are best practices for what to monitor from a jmx perspective? We already have pretty good system level monitoring set up through Dynatrace.
Any recommendations would be much appreciated!
2021-06-07 11:13:10,731 *ERROR* [FelixStartLevel] com.adobe.granite.cors bundle com.adobe.granite.cors:1.0.10.CQ650-B0002 (237)[com.adobe.granite.cors.impl.CORSPolicyImpl(745)] : The activate method has thrown an exception (org.osgi.service.component.ComponentException: Support Credentials is not allowed when Origin is set to Any (*).) org.osgi.service.component.ComponentException: Support Credentials is not allowed when Origin is set to Any (*). at com.adobe.granite.cors.impl.CORSPolicyImpl.activate(CORSPolicyImpl.java:204) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.felix.scr.impl.inject.methods.BaseMethod.invokeMethod(BaseMethod.java:228) at org.apache.felix.scr.impl.inject.methods.BaseMethod.access$500(BaseMethod.java:41) at org.apache.felix.scr.impl.inject.methods.BaseMethod$Resolved.invoke(BaseMethod.java:664) at org.apache.felix.scr.impl.inject.methods.BaseMethod.invoke(BaseMethod.java:510)
021-06-07 11:13:28,397 *ERROR* [FelixDispatchQueue] org.apache.felix.http.jetty FrameworkEvent ERROR (org.osgi.framework.ServiceException: Service factory returned null. (Component: com.adobe.granite.cors.impl.CORSFilter (744))) org.osgi.framework.ServiceException: Service factory returned null. (Component: com.adobe.granite.cors.impl.CORSFilter (744)) at org.apache.felix.framework.ServiceRegistrationImpl.getFactoryUnchecked(ServiceRegistrationImpl.java:381) at org.apache.felix.framework.ServiceRegistrationImpl.getService(ServiceRegistrationImpl.java:248) at org.apache.felix.framework.ServiceRegistry.getService(ServiceRegistry.java:350) at org.apache.felix.framework.Felix.getService(Felix.java:3954) at org.apache.felix.framework.BundleContextImpl$ServiceObjectsImpl.getService(BundleContextImpl.java:554)
2021-06-07 11:13:52,689 *ERROR* [FelixStartLevel] com.github.mickleroy.aem-sass-compiler bundle com.github.mickleroy.aem-sass-compiler:1.0.3 (617)[com.github.mickleroy.aem.sass.impl.SassCompilerImpl(4123)] : The activate method has thrown an exception (java.lang.UnsatisfiedLinkError: /hab/svc/author/data/tmp/libjsass-11549228781875460003/libjsass.so: libstdc++.so.6: cannot open shared object file: No such file or directory) java.lang.UnsatisfiedLinkError: /hab/svc/author/data/tmp/libjsass-11549228781875460003/libjsass.so: libstdc++.so.6: cannot open shared object file: No such file or directory at java.base/java.lang.ClassLoader$NativeLibrary.load0(Native Method) at java.base/java.lang.ClassLoader$NativeLibrary.load(ClassLoader.java:2430) at java.base/java.lang.ClassLoader$NativeLibrary.loadLibrary(ClassLoader.java:2487) at java.base/java.lang.ClassLoader.loadLibrary0(ClassLoader.java:2684)
2021-06-07 11:13:49,778 *ERROR* [FelixStartLevel] com.adobe.cq.dam.bp.cloudconfig.impl.MediaPortalCloudConfigurationListener exception occured in copying existing replication agents javax.jcr.AccessDeniedException: OakAccess0000: Access denied at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:232) at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:213) at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.newRepositoryException(SessionDelegate.java:669) at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.save(SessionDelegate.java:495)
2021-06-07 11:13:46,184 *ERROR* [FelixStartLevel] com.adobe.acs.acs-aem-commons-bundle bundle com.adobe.acs.acs-aem-commons-bundle:4.8.4 (580)[com.adobe.acs.commons.replication.packages.automatic.impl.ConfigurationUpdateListener(3481)] : The activate method has thrown an exception (java.lang.NullPointerException) java.lang.NullPointerException: null at com.adobe.acs.commons.util.ResourceServiceManager.refreshCache(ResourceServiceManager.java:142) at com.adobe.acs.commons.util.ResourceServiceManager.activate(ResourceServiceManager.java:74) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.felix.scr.impl.inject.methods.BaseMethod.invokeMethod(BaseMethod.java:228) at org.apache.felix.scr.impl.inject.methods.BaseMethod.access$500(BaseMethod.java:41)
2021-06-07 11:13:46,184 *ERROR* [FelixStartLevel] com.adobe.acs.commons.replication.packages.automatic.impl.ConfigurationUpdateListener Exception allocating resource resolver org.apache.sling.api.resource.LoginException: Cannot derive user name for bundle com.adobe.acs.acs-aem-commons-bundle  and sub service automatic-package-replicator at org.apache.sling.resourceresolver.impl.ResourceResolverFactoryImpl.getServiceResourceResolver(ResourceResolverFactoryImpl.java:79) at com.adobe.acs.commons.replication.packages.automatic.impl.ConfigurationUpdateListener.getResourceResolver(ConfigurationUpdateListener.java:97) at com.adobe.acs.commons.replication.packages.automatic.impl.ConfigurationUpdateListener.getResourceResolver(ConfigurationUpdateListener.java:107) at com.adobe.acs.commons.util.ResourceServiceManager.refreshCache(ResourceServiceManager.java:140) at com.adobe.acs.commons.util.ResourceServiceManager.activate(ResourceServiceManager.java:74)
From my perspective, there are two different areas to look at:
What happened to your application and why did it break? I recommend to perform a root cause analysis to identify the reason for the outage. Once you have identified the root cause you can either fix it or - if that's not easily possible for some reason - add monitoring probes for relevant parts of the application stack.
How can you pro-actively monitor your AEM instance to get an early notification if things start to go wrong?
Performing the root cause analysis
When it comes to production outages, it's always a bit hard to do a proper root cause analysis as you don't want to further affect your services uptime and performance (e. g. by increasing log levels, attaching debuggers or other analysis tools). So the ideal scenario is that you can somehow reproduce the behavior/issue on a lower environment or a production clone. Is this a one-time issue or is it reoccurring? First touch points for an analysis should be:
Your general, historical operating system level monitoring: CPU and memory usage, storage, network traffic, number of incoming requests, concurrent users, number of error status code responses, etc. Are there any abnormalities? Anything that looks suspicious? Any peaks that haven't been there before?
Your Java Virtual Machine (JVM): heap space, garbage collection, memory usage, etc. How is your application running inside the JVM? Could there me a memory leak? Is garbage collection working as expected? Does the JVM have enough memory? You may want to create and analyze garbage collection logs, heap dumps, thread dumps.
Your logs from shortly before the outage: Are there any outstanding messages/errors? Are there any jobs running that could influence the instances behavior? How is the amount of written logs compared to other days? The number of errors and exceptions?
With Dynatrace in place, you should have deep insight into the JVM and the application and probably have a good starting point for your root cause analysis. Looking at the error messages from your logs, I'm not sure if these are related in terms of a cause or if they are a consequence of the outage.
Looking at the monitoring part of your question, you should first identify the root cause and add according monitoring probes that will notify you on all aspects that initially may have lead to the outage. There is no one-size-fits-all monitoring concept as most outages are - in my experience - not caused by AEM product code but by projects custom application development inside the AEM framework/stack.
However, I can provide a generic list of monitoring points that I usually recommend:
Standard system level monitoring, such as hardware, filesystem, cpu load, memory usage, network traffic, etc.
JVM monitoring, such as heap space, threads, garbage collection, etc.
AEM specific monitoring, such as status of OSGI bundles, components, services; status of replication queues, workflows, job queues; log files with regards to response status codes, number of (concurrent) requests, specific errors/exceptions, amount of errors, exceptions, size of log files; repository size; security checks according to official Security Checklists.
Application specific monitoring - checking relevant probes of you projects custom application.