Monitoring AEM with JMX Exporter and Prometheus + Grafana (AEM Dashboard?) | Community
Skip to main content
this-that-the-otter
Level 4
April 4, 2023
Solved

Monitoring AEM with JMX Exporter and Prometheus + Grafana (AEM Dashboard?)

  • April 4, 2023
  • 1 reply
  • 4750 views

Is anyone monitoring AEM 6.5 with JMX Exporter and Prometheus + Grafana? We have a rudimentary configuration which only seems to capture JVM metrics about memory use, threads, class loading, etc. which is great - but I would like to gather more details related to specific AEM functionality like replication, etc.

 

I saw the following solution from @wimsymonsvrt here:

- https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/aem-prometheus-jmx-exporter/m-p/260853

but haven't tried doing something similar in my jmx-config.yaml file due to the age of the post.

 

I'm using the following JVM options in my crx-quickstart/bin/start file:

-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.rmi.port=9010 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -javaagent:/opt/prometheus/jmx_prometheus_javaagent-0.17.2.jar=9404:/opt/prometheus/jmx-config.yaml

My jmx-config.yaml file looks like this:

startDelaySeconds: 0 ssl: false rules: - pattern: ".*"

which I think means get everything - nothing is whitelisted of blacklisted.

 

I'm using the following dashboard in Grafana:

https://grafana.com/grafana/dashboards/14845-jmx-dashboard-basic/

 

If I curl the exporter:

curl 127.0.0.1:9404

I do see a ton of Adobe and Apache related info. I expect Prometheus is storing this data. But the dashboard I'm using doesn't display anything beyond basic info re: memory, threads, class loading, etc.

 

Maybe I need a better AEM specific dashboard?

 

Thanks for any info and help!

Best answer by wimsymonsvrt

What I posted, was the entire jmx-config.yaml file. There is no more to it.

 

Scraping interval is indeed controlled by Prometheus. I'm not an expert in that area, as our operations team has set up Prometheus company-wide. Please read the Prometheus documentation on how to control that.

 

And you do need to create new panels with PromQL queries in your Grafana dashboard to visualize the extra AEM metrics the JMX exporter provides.

 

A snippet of the data provided by the JMX exporter:

# HELP com_adobe_granite_replication_agent_QueueNumEntries Returns the number of entries in the replication queue. (com.adobe.granite.replication<type=agent, id="static"><>QueueNumEntries) # TYPE com_adobe_granite_replication_agent_QueueNumEntries untyped com_adobe_granite_replication_agent_QueueNumEntries{id="\"static\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"youtube\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"flush\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"dynamic_media_replication\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"screens\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"test_and_target\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"scene7\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"publish_reverse\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"s7delivery\"",} 0.0

 

Next is to set up your Grafana dashboard, you can check https://grafana.com/docs/grafana/latest/getting-started/build-first-dashboard/ on how to do that. You can find several video's on Youtube on the same topic (grafana tutorial).

 

An example query (related to the same snippet of data, in this case the number of items on the replication queue) is:

max_over_time(com_adobe_granite_replication_agent_QueueNumEntries[5m])

This gives you the maximum number of items on each replication queue over a 5 minute period. If your replication queues are blocked, you will see the graph go up at that point in time. This fits best in a time series visualization.

 

You can find more details on how to create PromQL queries at https://prometheus.io/docs/prometheus/latest/querying/basics/.

 

I can not share our dashboard code. Company policy doesn't allow that.

 

Good luck building your perfect AEM Grafana dashboard!

1 reply

Level 4
April 4, 2023

Our jmx-config.yaml and jmx-exporter version (0.14) is still the same.

whitelistObjectNames: [ "com.adobe.granite.replication:type=agent,*", "com.adobe.granite.requests.logging:type=Metrics,name=granite.request.metrics.timer", "com.adobe.granite:type=Repository", "org.apache.jackrabbit.oak:type=IndexStats,*", "org.apache.jackrabbit.oak:type=Metrics,name=SESSION_COUNT", "org.apache.jackrabbit.oak:type=SegmentRevisionGarbageCollection,*", "org.apache.jackrabbit.oak:type=\"Standby\",*", "org.apache.sling.healthcheck:type=HealthCheck,name=MaintenanceTaskRevisionCleanupTask", "org.apache.sling.installer:type=Installer,name=Sling OSGi Installer", "org.apache.sling:type=queues,*", ]

This still works fine on AEM 6.5.15.0.

 

This provides us Grafana dashboards with Active and Queued Sling Jobs, JCR Session Count, Replication Queue Length, Cold Standby Lag, Async Indexing, Last GC duration, JVM Heap Space Usage, Segment Store Size.

 

But you do need to develop your Prometheus queries based on those metrics. That's no so hard.

 

I want to give you a final pointer: don't export all JMX Beans. It will slow down your AEM instance and it can create memory leaks as Adobe's JMX Beans are not built to be queried every x seconds.

this-that-the-otter
Level 4
April 4, 2023

Thanks for the info. I'll give your config a go.

Do you know what controls the polling interval and how would I adjust that?

Maybe it's scrape interval in my grafana-agent.yaml config file (below)?  

 

server: log_level: warn metrics: global: scrape_interval: 30s remote_write: - url: 'https://xxx/api/v1/remote_write' sigv4: region: 'xxx' queue_config: max_samples_per_send: 1000 max_shards: 200 capacity: 2500 wal_directory: '/var/lib/grafana-agent' configs: - name: AEM_JMX scrape_configs: - job_name: AEM_JMX static_configs: - targets: ['xxx:9404'] integrations: agent: enabled: true node_exporter: enabled: true include_exporter_metrics: true enable_collectors: # - "ntp" - "systemd" disable_collectors: - "mdadm"

 

Other than whitelist, can you post your jmx-config.yaml config file? I'm not sure what else might be good or bad to have there.

When you say:

 

But you do need to develop your Prometheus queries based on those metrics. That's no so hard.


Is this done via a dashboard file? I'm using one downloaded from here:

https://grafana.com/grafana/dashboards/14845-jmx-dashboard-basic/

If you could post an example (or some info) of how to display/query an AEM-specific metric, I would be grateful.

I'm also seeing the following in our error log (w/ our current * export):

 

04.04.2023 14:53:57.436 *WARN* [prometheus-http-1-1] com.day.crx.sling.server.impl.jmx.SecureContentRepositoryAccess Denied reference from bundle 'org.apache.aries.jmx.core'. 04.04.2023 14:53:58.001 *INFO* [HealthCheck Synchronized Clocks] org.apache.sling.discovery.oak.SynchronizedClocksHealthCheck execute: no topology connectors connected to local instance. 04.04.2023 14:53:58.164 *WARN* [prometheus-http-1-1] org.apache.jackrabbit.vault.packaging.impl.PackageManagerMBeanImpl Unable to provide package list. Repository not bound.

 

I tried the following jmx-config.yaml configs:

 

startDelaySeconds: 0 ssl: false whitelistObjectNames: [ "com.adobe.granite.replication:type=agent,*", "com.adobe.granite.requests.logging:type=Metrics,name=granite.request.metrics.timer", "com.adobe.granite:type=Repository", "org.apache.jackrabbit.oak:type=IndexStats,*", "org.apache.jackrabbit.oak:type=Metrics,name=SESSION_COUNT", "org.apache.jackrabbit.oak:type=SegmentRevisionGarbageCollection,*", "org.apache.jackrabbit.oak:type=\"Standby\",*", "org.apache.sling.healthcheck:type=HealthCheck,name=MaintenanceTaskRevisionCleanupTask", "org.apache.sling.installer:type=Installer,name=Sling OSGi Installer", "org.apache.sling:type=queues,*", ]startDelaySeconds: 0 ssl: false whitelistObjectNames: [ "com.adobe.granite.replication:type=agent,*", "com.adobe.granite.requests.logging:type=Metrics,name=granite.request.metrics.timer", "com.adobe.granite:type=Repository", "org.apache.jackrabbit.oak:type=IndexStats,*", "org.apache.jackrabbit.oak:type=Metrics,name=SESSION_COUNT", "org.apache.jackrabbit.oak:type=SegmentRevisionGarbageCollection,*", "org.apache.jackrabbit.oak:type=\"Standby\",*", "org.apache.sling.healthcheck:type=HealthCheck,name=MaintenanceTaskRevisionCleanupTask", "org.apache.sling.installer:type=Installer,name=Sling OSGi Installer", "org.apache.sling:type=queues,*", ] rules: - pattern: ".*"

 

and actually was missing data (system things like Physical Memory) that had been there using the original jmx-config.yaml:

 

startDelaySeconds: 0 ssl: false rules: - pattern: ".*"

 

Also, with your whitelisted config, AEM-related metrics didn't show up in my dashboard, but maybe I need to do some specific queries.

Thanks again!

wimsymonsvrtAccepted solution
Level 4
April 5, 2023

What I posted, was the entire jmx-config.yaml file. There is no more to it.

 

Scraping interval is indeed controlled by Prometheus. I'm not an expert in that area, as our operations team has set up Prometheus company-wide. Please read the Prometheus documentation on how to control that.

 

And you do need to create new panels with PromQL queries in your Grafana dashboard to visualize the extra AEM metrics the JMX exporter provides.

 

A snippet of the data provided by the JMX exporter:

# HELP com_adobe_granite_replication_agent_QueueNumEntries Returns the number of entries in the replication queue. (com.adobe.granite.replication<type=agent, id="static"><>QueueNumEntries) # TYPE com_adobe_granite_replication_agent_QueueNumEntries untyped com_adobe_granite_replication_agent_QueueNumEntries{id="\"static\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"youtube\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"flush\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"dynamic_media_replication\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"screens\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"test_and_target\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"scene7\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"publish_reverse\"",} 0.0 com_adobe_granite_replication_agent_QueueNumEntries{id="\"s7delivery\"",} 0.0

 

Next is to set up your Grafana dashboard, you can check https://grafana.com/docs/grafana/latest/getting-started/build-first-dashboard/ on how to do that. You can find several video's on Youtube on the same topic (grafana tutorial).

 

An example query (related to the same snippet of data, in this case the number of items on the replication queue) is:

max_over_time(com_adobe_granite_replication_agent_QueueNumEntries[5m])

This gives you the maximum number of items on each replication queue over a 5 minute period. If your replication queues are blocked, you will see the graph go up at that point in time. This fits best in a time series visualization.

 

You can find more details on how to create PromQL queries at https://prometheus.io/docs/prometheus/latest/querying/basics/.

 

I can not share our dashboard code. Company policy doesn't allow that.

 

Good luck building your perfect AEM Grafana dashboard!