Replication agent queue is blocked, frequently | Adobe Higher Education
Skip to main content
Level 2
November 2, 2023
解決済み

Replication agent queue is blocked, frequently

  • November 2, 2023
  • 2 の返信
  • 1925 ビュー

In my local env, I have 4 publisher. Sometime replication agent gets blocked and I will take time to find the cause, so is there any way to monitor the replication queue and get the alerts if its blocked.

このトピックへの返信は締め切られました。
ベストアンサー MarkusBullaAdobe

Hi @akash1247!

It is a good idea to monitor the status of your replication queues.

There are different approaches to this. One that I have seen with some of my customers is the following:

  • Write a script to query the status of your replication agents. This could be e. g. a CLI script leveraging curl/wget to retrieve the replication agent status page (/etc/replication/agents.author.html) or the status of each replication agent individually. Parse the output and return the status information in a format of your preference.
  • Integrate this script with your monitoring suite to retrieve alerts for blocked queues.

 

A simple BASH script could look like the following:

 

curl ${CURLPARAMETERS} -o tmp-replication-check.txt "http://${CQ_HOSTPORT}/etc/replication/agents.${CQ_TYPE}.html" # get activated agents REPL_AGENTS=`cat tmp-replication-check.txt | grep "cq-agent-header-on" | cut -d"\"" -f4` REPL_AGENT_NUM=`cat tmp-replication-check.txt | grep "cq-agent-header-on" | cut -d"\"" -f4 | wc -l` REPL_AGENT_PROBLEMS=0 AGENT_COUNT=0 echo "Found ${REPL_AGENT_NUM} activated replicatication agents, now checking each of them..." # check all activated agents for AGENT in ${REPL_AGENTS} do AGENT=${AGENT%.html} SHORT_AGENT=`echo ${AGENT} | cut -d"/" -f5` AGENT_COUNT=$((AGENT_COUNT + 1)) # check agent status REPL_AGENT_STATUS=`cat tmp-replication-check.txt | grep -a2 "${AGENT}" | egrep "cq-agent-status-|cq-agent-queue-" | cut -d"\"" -f2 | cut -d" " -f2` REPL_AGENT_UNEXP_STATUS=`echo "${REPL_AGENT_STATUS}" | grep -v "cq-agent-status-ok" | grep -v "cq-agent-queue-idle" | grep -v "cq-agent-queue-active" | wc -l` if [ "${REPL_AGENT_UNEXP_STATUS}" -gt 0 ]; then ERROR_OUTPUT=`echo "${REPL_AGENT_STATUS}" | tr "\\n" " "` echo " Agent ${AGENT} has unexpected status: ${ERROR_OUTPUT} - please check!" REPL_AGENT_PROBLEMS=$((REPL_AGENT_PROBLEMS + 1)) else echo " Status of agent ${AGENT} is OK." fi # check logfile of agent curl ${CURLPARAMETERS} -o tmp-replication-log-${AGENT_COUNT}.txt "http://${CQ_HOSTPORT}${AGENT}.log.html" LOGS_NOTOK=false # check logs for ERROR messages NUM_LINES_TO_CHECK=25 NUM_ERRORS_IN_LOGS=`tail -${NUM_LINES_TO_CHECK} tmp-replication-log-${AGENT_COUNT}.txt | grep "ERROR" | wc -l` if [ "${NUM_ERRORS_IN_LOGS}" -gt 0 ]; then ERROR_OUTPUT_LOG=`tail -${NUM_LINES_TO_CHECK} ptmp-replication-log-${AGENT_COUNT}.txt | grep "ERROR"` echo " Agent ${AGENT} shows ERRORs in logs - please check!" echo ${ERROR_OUTPUT_LOG} REPL_AGENT_PROBLEMS=$((REPL_AGENT_PROBLEMS + 1)) LOGS_NOTOK=true else LOGS_NOTOK=false echo " LOGs of agent ${AGENT} are OK." fi # cleanup rm -f tmp-replication-log-${AGENT_COUNT}.txt done if [ "${REPL_AGENT_PROBLEMS}" -gt 0 ]; then echo "Found ${REPL_AGENT_PROBLEMS} problem(s) with agents! Please check!" RETURN_CODE=2 else echo "All activated agents are ok." RETURN_CODE=0 fi # cleanup rm -f tmp-replication-check.txt

 

 

 Disclaimer: This script is an extract from a pretty old scripting framework I wrote multiple years ago and has neither been individually tested nor is it ready-to-run. It is just meant for illustration purposes. There is lots of room for improvement, error checks as well as sanitizations are missing and there are probably other/better ways to achieve this. It's also very debatable if using bash is the right approach for this kind of task in the first place. It might be better to use an actual scripting language or develop a proper health check / endpoint in Java as part of your application for monitoring purposes that can directly be consumed by a monitoring system.

 

In addition to the monitoring part you should also investigate to find the root cause of frequently blocked replication queues as this is not a common/normal behavior.

 

Hope this helps!

2 の返信

ManviSharma
Adobe Employee
Adobe Employee
November 2, 2023

Hi,

 

To monitor and get alerts for blocked replication queues in AEM:

  1. Access the Adobe Granite Replication dashboard at http://localhost:4503/libs/granite/replication/content/status.html.

  2. Monitor replication agents' status at http://localhost:4502/etc/replication/agents.author.html.

  3. Set up alerts using AEM's "Notifications" feature for specific replication issues.

  4. Analyze replication agent logs for debugging.

  5. Consider using third-party monitoring tools for more extensive monitoring.

Akash1247作成者
Level 2
November 2, 2023

@manvisharma  can we write any script to get status of each publishers like if its idle state or blocked.
or may be using curl
I can get status of publishers using /libs/granite/operations/content/healthreports/healthreport.html/system/sling/monitoring/mbeans/org/apache/sling/healthcheck/HealthCheck/replicationQueue
but I again its manual thing.

MarkusBullaAdobe
Adobe Employee
Adobe Employee
November 3, 2023

Hi @akash1247!

It is a good idea to monitor the status of your replication queues.

There are different approaches to this. One that I have seen with some of my customers is the following:

  • Write a script to query the status of your replication agents. This could be e. g. a CLI script leveraging curl/wget to retrieve the replication agent status page (/etc/replication/agents.author.html) or the status of each replication agent individually. Parse the output and return the status information in a format of your preference.
  • Integrate this script with your monitoring suite to retrieve alerts for blocked queues.

 

A simple BASH script could look like the following:

 

curl ${CURLPARAMETERS} -o tmp-replication-check.txt "http://${CQ_HOSTPORT}/etc/replication/agents.${CQ_TYPE}.html" # get activated agents REPL_AGENTS=`cat tmp-replication-check.txt | grep "cq-agent-header-on" | cut -d"\"" -f4` REPL_AGENT_NUM=`cat tmp-replication-check.txt | grep "cq-agent-header-on" | cut -d"\"" -f4 | wc -l` REPL_AGENT_PROBLEMS=0 AGENT_COUNT=0 echo "Found ${REPL_AGENT_NUM} activated replicatication agents, now checking each of them..." # check all activated agents for AGENT in ${REPL_AGENTS} do AGENT=${AGENT%.html} SHORT_AGENT=`echo ${AGENT} | cut -d"/" -f5` AGENT_COUNT=$((AGENT_COUNT + 1)) # check agent status REPL_AGENT_STATUS=`cat tmp-replication-check.txt | grep -a2 "${AGENT}" | egrep "cq-agent-status-|cq-agent-queue-" | cut -d"\"" -f2 | cut -d" " -f2` REPL_AGENT_UNEXP_STATUS=`echo "${REPL_AGENT_STATUS}" | grep -v "cq-agent-status-ok" | grep -v "cq-agent-queue-idle" | grep -v "cq-agent-queue-active" | wc -l` if [ "${REPL_AGENT_UNEXP_STATUS}" -gt 0 ]; then ERROR_OUTPUT=`echo "${REPL_AGENT_STATUS}" | tr "\\n" " "` echo " Agent ${AGENT} has unexpected status: ${ERROR_OUTPUT} - please check!" REPL_AGENT_PROBLEMS=$((REPL_AGENT_PROBLEMS + 1)) else echo " Status of agent ${AGENT} is OK." fi # check logfile of agent curl ${CURLPARAMETERS} -o tmp-replication-log-${AGENT_COUNT}.txt "http://${CQ_HOSTPORT}${AGENT}.log.html" LOGS_NOTOK=false # check logs for ERROR messages NUM_LINES_TO_CHECK=25 NUM_ERRORS_IN_LOGS=`tail -${NUM_LINES_TO_CHECK} tmp-replication-log-${AGENT_COUNT}.txt | grep "ERROR" | wc -l` if [ "${NUM_ERRORS_IN_LOGS}" -gt 0 ]; then ERROR_OUTPUT_LOG=`tail -${NUM_LINES_TO_CHECK} ptmp-replication-log-${AGENT_COUNT}.txt | grep "ERROR"` echo " Agent ${AGENT} shows ERRORs in logs - please check!" echo ${ERROR_OUTPUT_LOG} REPL_AGENT_PROBLEMS=$((REPL_AGENT_PROBLEMS + 1)) LOGS_NOTOK=true else LOGS_NOTOK=false echo " LOGs of agent ${AGENT} are OK." fi # cleanup rm -f tmp-replication-log-${AGENT_COUNT}.txt done if [ "${REPL_AGENT_PROBLEMS}" -gt 0 ]; then echo "Found ${REPL_AGENT_PROBLEMS} problem(s) with agents! Please check!" RETURN_CODE=2 else echo "All activated agents are ok." RETURN_CODE=0 fi # cleanup rm -f tmp-replication-check.txt

 

 

 Disclaimer: This script is an extract from a pretty old scripting framework I wrote multiple years ago and has neither been individually tested nor is it ready-to-run. It is just meant for illustration purposes. There is lots of room for improvement, error checks as well as sanitizations are missing and there are probably other/better ways to achieve this. It's also very debatable if using bash is the right approach for this kind of task in the first place. It might be better to use an actual scripting language or develop a proper health check / endpoint in Java as part of your application for monitoring purposes that can directly be consumed by a monitoring system.

 

In addition to the monitoring part you should also investigate to find the root cause of frequently blocked replication queues as this is not a common/normal behavior.

 

Hope this helps!