Expand my Community achievements bar.

SOLVED

Monitoring AEM with JMX Exporter and Prometheus + Grafana (AEM Dashboard?)

Avatar

Level 4

Is anyone monitoring AEM 6.5 with JMX Exporter and Prometheus + Grafana? We have a rudimentary configuration which only seems to capture JVM metrics about memory use, threads, class loading, etc. which is great - but I would like to gather more details related to specific AEM functionality like replication, etc.

 

I saw the following solution from @wimsymons here:

- https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/aem-prometheus-jmx-exporte...

but haven't tried doing something similar in my jmx-config.yaml file due to the age of the post.

 

I'm using the following JVM options in my crx-quickstart/bin/start file:

-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.rmi.port=9010
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false

-javaagent:/opt/prometheus/jmx_prometheus_javaagent-0.17.2.jar=9404:/opt/prometheus/jmx-config.yaml

My jmx-config.yaml file looks like this:

startDelaySeconds: 0
ssl: false
rules:
  - pattern: ".*"

which I think means get everything - nothing is whitelisted of blacklisted.

 

I'm using the following dashboard in Grafana:

https://grafana.com/grafana/dashboards/14845-jmx-dashboard-basic/

 

If I curl the exporter:

curl 127.0.0.1:9404

I do see a ton of Adobe and Apache related info. I expect Prometheus is storing this data. But the dashboard I'm using doesn't display anything beyond basic info re: memory, threads, class loading, etc.

 

Maybe I need a better AEM specific dashboard?

 

Thanks for any info and help!

1 Accepted Solution

Avatar

Correct answer by
Level 4

What I posted, was the entire jmx-config.yaml file. There is no more to it.

 

Scraping interval is indeed controlled by Prometheus. I'm not an expert in that area, as our operations team has set up Prometheus company-wide. Please read the Prometheus documentation on how to control that.

 

And you do need to create new panels with PromQL queries in your Grafana dashboard to visualize the extra AEM metrics the JMX exporter provides.

 

A snippet of the data provided by the JMX exporter:

# HELP com_adobe_granite_replication_agent_QueueNumEntries Returns the number of entries in the replication queue. (com.adobe.granite.replication<type=agent, id="static"><>QueueNumEntries)
# TYPE com_adobe_granite_replication_agent_QueueNumEntries untyped
com_adobe_granite_replication_agent_QueueNumEntries{id="\"static\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"youtube\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"flush\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"dynamic_media_replication\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"screens\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"test_and_target\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"scene7\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"publish_reverse\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"s7delivery\"",} 0.0

 

Next is to set up your Grafana dashboard, you can check https://grafana.com/docs/grafana/latest/getting-started/build-first-dashboard/ on how to do that. You can find several video's on Youtube on the same topic (grafana tutorial).

 

An example query (related to the same snippet of data, in this case the number of items on the replication queue) is:

max_over_time(com_adobe_granite_replication_agent_QueueNumEntries[5m])

This gives you the maximum number of items on each replication queue over a 5 minute period. If your replication queues are blocked, you will see the graph go up at that point in time. This fits best in a time series visualization.

 

You can find more details on how to create PromQL queries at https://prometheus.io/docs/prometheus/latest/querying/basics/.

 

I can not share our dashboard code. Company policy doesn't allow that.

 

Good luck building your perfect AEM Grafana dashboard!

View solution in original post

4 Replies

Avatar

Level 4

Our jmx-config.yaml and jmx-exporter version (0.14) is still the same.

whitelistObjectNames: [
  "com.adobe.granite.replication:type=agent,*",
  "com.adobe.granite.requests.logging:type=Metrics,name=granite.request.metrics.timer",
  "com.adobe.granite:type=Repository",
  "org.apache.jackrabbit.oak:type=IndexStats,*",
  "org.apache.jackrabbit.oak:type=Metrics,name=SESSION_COUNT",
  "org.apache.jackrabbit.oak:type=SegmentRevisionGarbageCollection,*",
  "org.apache.jackrabbit.oak:type=\"Standby\",*",
  "org.apache.sling.healthcheck:type=HealthCheck,name=MaintenanceTaskRevisionCleanupTask",
  "org.apache.sling.installer:type=Installer,name=Sling OSGi Installer",
  "org.apache.sling:type=queues,*",
]

This still works fine on AEM 6.5.15.0.

 

This provides us Grafana dashboards with Active and Queued Sling Jobs, JCR Session Count, Replication Queue Length, Cold Standby Lag, Async Indexing, Last GC duration, JVM Heap Space Usage, Segment Store Size.

 

But you do need to develop your Prometheus queries based on those metrics. That's no so hard.

 

I want to give you a final pointer: don't export all JMX Beans. It will slow down your AEM instance and it can create memory leaks as Adobe's JMX Beans are not built to be queried every x seconds.

Avatar

Level 4

Thanks for the info. I'll give your config a go.

Do you know what controls the polling interval and how would I adjust that?

Maybe it's scrape interval in my grafana-agent.yaml config file (below)?  

 

server:
  log_level: warn

metrics:
  global:
    scrape_interval: 30s
    remote_write:
      - url: 'https://xxx/api/v1/remote_write'
        sigv4:
          region: 'xxx'
        queue_config:
          max_samples_per_send: 1000
          max_shards: 200
          capacity: 2500
  wal_directory: '/var/lib/grafana-agent'
  configs:
    - name: AEM_JMX
      scrape_configs:
        - job_name: AEM_JMX
          static_configs:
            - targets: ['xxx:9404']
integrations:
  agent:
    enabled: true
  node_exporter:
    enabled: true
    include_exporter_metrics: true
    enable_collectors:
            #      - "ntp"
      - "systemd"
    disable_collectors:
      - "mdadm"

 

Other than whitelist, can you post your jmx-config.yaml config file? I'm not sure what else might be good or bad to have there.

When you say:

 

But you do need to develop your Prometheus queries based on those metrics. That's no so hard.


Is this done via a dashboard file? I'm using one downloaded from here:

https://grafana.com/grafana/dashboards/14845-jmx-dashboard-basic/

If you could post an example (or some info) of how to display/query an AEM-specific metric, I would be grateful.

I'm also seeing the following in our error log (w/ our current * export):

 

04.04.2023 14:53:57.436 *WARN* [prometheus-http-1-1] com.day.crx.sling.server.impl.jmx.SecureContentRepositoryAccess Denied reference from bundle 'org.apache.aries.jmx.core'.
04.04.2023 14:53:58.001 *INFO* [HealthCheck Synchronized Clocks] org.apache.sling.discovery.oak.SynchronizedClocksHealthCheck execute: no topology connectors connected to local instance.
04.04.2023 14:53:58.164 *WARN* [prometheus-http-1-1] org.apache.jackrabbit.vault.packaging.impl.PackageManagerMBeanImpl Unable to provide package list. Repository not bound.

 

I tried the following jmx-config.yaml configs:

 

startDelaySeconds: 0
ssl: false
whitelistObjectNames: [
  "com.adobe.granite.replication:type=agent,*",
  "com.adobe.granite.requests.logging:type=Metrics,name=granite.request.metrics.timer",
  "com.adobe.granite:type=Repository",
  "org.apache.jackrabbit.oak:type=IndexStats,*",
  "org.apache.jackrabbit.oak:type=Metrics,name=SESSION_COUNT",
  "org.apache.jackrabbit.oak:type=SegmentRevisionGarbageCollection,*",
  "org.apache.jackrabbit.oak:type=\"Standby\",*",
  "org.apache.sling.healthcheck:type=HealthCheck,name=MaintenanceTaskRevisionCleanupTask",
  "org.apache.sling.installer:type=Installer,name=Sling OSGi Installer",
  "org.apache.sling:type=queues,*",
]
startDelaySeconds: 0
ssl: false
whitelistObjectNames: [
  "com.adobe.granite.replication:type=agent,*",
  "com.adobe.granite.requests.logging:type=Metrics,name=granite.request.metrics.timer",
  "com.adobe.granite:type=Repository",
  "org.apache.jackrabbit.oak:type=IndexStats,*",
  "org.apache.jackrabbit.oak:type=Metrics,name=SESSION_COUNT",
  "org.apache.jackrabbit.oak:type=SegmentRevisionGarbageCollection,*",
  "org.apache.jackrabbit.oak:type=\"Standby\",*",
  "org.apache.sling.healthcheck:type=HealthCheck,name=MaintenanceTaskRevisionCleanupTask",
  "org.apache.sling.installer:type=Installer,name=Sling OSGi Installer",
  "org.apache.sling:type=queues,*",
]
rules:
  - pattern: ".*"

 

and actually was missing data (system things like Physical Memory) that had been there using the original jmx-config.yaml:

 

startDelaySeconds: 0
ssl: false
rules:
  - pattern: ".*"

 

Also, with your whitelisted config, AEM-related metrics didn't show up in my dashboard, but maybe I need to do some specific queries.

Thanks again!

Avatar

Correct answer by
Level 4

What I posted, was the entire jmx-config.yaml file. There is no more to it.

 

Scraping interval is indeed controlled by Prometheus. I'm not an expert in that area, as our operations team has set up Prometheus company-wide. Please read the Prometheus documentation on how to control that.

 

And you do need to create new panels with PromQL queries in your Grafana dashboard to visualize the extra AEM metrics the JMX exporter provides.

 

A snippet of the data provided by the JMX exporter:

# HELP com_adobe_granite_replication_agent_QueueNumEntries Returns the number of entries in the replication queue. (com.adobe.granite.replication<type=agent, id="static"><>QueueNumEntries)
# TYPE com_adobe_granite_replication_agent_QueueNumEntries untyped
com_adobe_granite_replication_agent_QueueNumEntries{id="\"static\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"youtube\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"flush\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"dynamic_media_replication\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"screens\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"test_and_target\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"scene7\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"publish_reverse\"",} 0.0
com_adobe_granite_replication_agent_QueueNumEntries{id="\"s7delivery\"",} 0.0

 

Next is to set up your Grafana dashboard, you can check https://grafana.com/docs/grafana/latest/getting-started/build-first-dashboard/ on how to do that. You can find several video's on Youtube on the same topic (grafana tutorial).

 

An example query (related to the same snippet of data, in this case the number of items on the replication queue) is:

max_over_time(com_adobe_granite_replication_agent_QueueNumEntries[5m])

This gives you the maximum number of items on each replication queue over a 5 minute period. If your replication queues are blocked, you will see the graph go up at that point in time. This fits best in a time series visualization.

 

You can find more details on how to create PromQL queries at https://prometheus.io/docs/prometheus/latest/querying/basics/.

 

I can not share our dashboard code. Company policy doesn't allow that.

 

Good luck building your perfect AEM Grafana dashboard!

Avatar

Level 4

Thank you, this is very helpful! I tried your example and it seems to work. Now to setup an alarm if the queue seems blocked ... do you do this?

I'll try to come up with queries for:


@wimsymons wrote:

Active and Queued Sling Jobs, JCR Session Count, Replication Queue Length, Cold Standby Lag, Async Indexing, Last GC duration, JVM Heap Space Usage, Segment Store Size.


Can you recommend the names of these? I saved the output of 

curl localhost:9404 > jmx.out

and filtered on: Metrics, adobe, apache, etc. but there is a lot. I'm wondering which are the most useful?

Learning PromQL and the finer points of Grafana dashboard creation I suspect could take quite some time to master.

It would be great for anyone who can share, or knows of existing AEM-related Grafana dashboards, to publish/share share.png the dashboard .json somewhere, like here, GitHub, or on the Grafana dashboards https://grafana.com/grafana/dashboards/ page.

If I come up with any that seem useful, I'll try to do the same.