Expand my Community achievements bar.

Join us in celebrating the outstanding achievement of our AEM Community Member of the Year!
SOLVED

Replication agent queue is blocked, frequently

Avatar

Level 2

In my local env, I have 4 publisher. Sometime replication agent gets blocked and I will take time to find the cause, so is there any way to monitor the replication queue and get the alerts if its blocked.

Topics

Topics help categorize Community content and increase your ability to discover relevant content.

1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

Hi @Akash1247!

It is a good idea to monitor the status of your replication queues.

There are different approaches to this. One that I have seen with some of my customers is the following:

  • Write a script to query the status of your replication agents. This could be e. g. a CLI script leveraging curl/wget to retrieve the replication agent status page (/etc/replication/agents.author.html) or the status of each replication agent individually. Parse the output and return the status information in a format of your preference.
  • Integrate this script with your monitoring suite to retrieve alerts for blocked queues.

 

A simple BASH script could look like the following:

 

curl ${CURLPARAMETERS} -o tmp-replication-check.txt "http://${CQ_HOSTPORT}/etc/replication/agents.${CQ_TYPE}.html"

# get activated agents
REPL_AGENTS=`cat tmp-replication-check.txt | grep "cq-agent-header-on" | cut -d"\"" -f4`
REPL_AGENT_NUM=`cat tmp-replication-check.txt | grep "cq-agent-header-on" | cut -d"\"" -f4 | wc -l`
REPL_AGENT_PROBLEMS=0
AGENT_COUNT=0
echo "Found ${REPL_AGENT_NUM} activated replicatication agents, now checking each of them..."
# check all activated agents
for AGENT in ${REPL_AGENTS}
do
        AGENT=${AGENT%.html}
        SHORT_AGENT=`echo ${AGENT} | cut -d"/" -f5`
        AGENT_COUNT=$((AGENT_COUNT + 1))
        # check agent status
        REPL_AGENT_STATUS=`cat tmp-replication-check.txt | grep -a2 "${AGENT}" | egrep "cq-agent-status-|cq-agent-queue-" | cut -d"\"" -f2 | cut -d" " -f2`
        REPL_AGENT_UNEXP_STATUS=`echo "${REPL_AGENT_STATUS}" | grep -v "cq-agent-status-ok" | grep -v "cq-agent-queue-idle" | grep -v "cq-agent-queue-active" | wc -l`
        if [ "${REPL_AGENT_UNEXP_STATUS}" -gt 0 ]; then
                ERROR_OUTPUT=`echo "${REPL_AGENT_STATUS}" | tr "\\n" " "`
                echo "    Agent ${AGENT} has unexpected status: ${ERROR_OUTPUT} - please check!"
                REPL_AGENT_PROBLEMS=$((REPL_AGENT_PROBLEMS + 1))
        else
                echo "    Status of agent ${AGENT} is  OK."
        fi

        # check logfile of agent
	curl ${CURLPARAMETERS} -o tmp-replication-log-${AGENT_COUNT}.txt "http://${CQ_HOSTPORT}${AGENT}.log.html" 

        LOGS_NOTOK=false

        # check logs for ERROR messages
        NUM_LINES_TO_CHECK=25
        NUM_ERRORS_IN_LOGS=`tail -${NUM_LINES_TO_CHECK} tmp-replication-log-${AGENT_COUNT}.txt | grep "ERROR" | wc -l`
        if [ "${NUM_ERRORS_IN_LOGS}" -gt 0 ]; then
                ERROR_OUTPUT_LOG=`tail -${NUM_LINES_TO_CHECK} ptmp-replication-log-${AGENT_COUNT}.txt | grep "ERROR"`
                echo "    Agent ${AGENT} shows ERRORs in logs - please check!"
                echo ${ERROR_OUTPUT_LOG}
                REPL_AGENT_PROBLEMS=$((REPL_AGENT_PROBLEMS + 1))
                LOGS_NOTOK=true
        else
                LOGS_NOTOK=false
                echo "    LOGs of agent   ${AGENT} are OK."
        fi
        # cleanup
        rm -f tmp-replication-log-${AGENT_COUNT}.txt
done

if [ "${REPL_AGENT_PROBLEMS}" -gt 0 ]; then
        echo "Found ${REPL_AGENT_PROBLEMS} problem(s) with agents! Please check!"
        RETURN_CODE=2
else
        echo "All activated agents are ok."
        RETURN_CODE=0
fi

# cleanup
rm -f tmp-replication-check.txt

 

 

 Disclaimer: This script is an extract from a pretty old scripting framework I wrote multiple years ago and has neither been individually tested nor is it ready-to-run. It is just meant for illustration purposes. There is lots of room for improvement, error checks as well as sanitizations are missing and there are probably other/better ways to achieve this. It's also very debatable if using bash is the right approach for this kind of task in the first place. It might be better to use an actual scripting language or develop a proper health check / endpoint in Java as part of your application for monitoring purposes that can directly be consumed by a monitoring system.

 

In addition to the monitoring part you should also investigate to find the root cause of frequently blocked replication queues as this is not a common/normal behavior.

 

Hope this helps!

View solution in original post

3 Replies

Avatar

Employee Advisor

Hi,

 

To monitor and get alerts for blocked replication queues in AEM:

  1. Access the Adobe Granite Replication dashboard at http://localhost:4503/libs/granite/replication/content/status.html.

  2. Monitor replication agents' status at http://localhost:4502/etc/replication/agents.author.html.

  3. Set up alerts using AEM's "Notifications" feature for specific replication issues.

  4. Analyze replication agent logs for debugging.

  5. Consider using third-party monitoring tools for more extensive monitoring.

Avatar

Level 2

@ManviSharma  can we write any script to get status of each publishers like if its idle state or blocked.
or may be using curl
I can get status of publishers using /libs/granite/operations/content/healthreports/healthreport.html/system/sling/monitoring/mbeans/org/apache/sling/healthcheck/HealthCheck/replicationQueue
but I again its manual thing.

Avatar

Correct answer by
Employee Advisor

Hi @Akash1247!

It is a good idea to monitor the status of your replication queues.

There are different approaches to this. One that I have seen with some of my customers is the following:

  • Write a script to query the status of your replication agents. This could be e. g. a CLI script leveraging curl/wget to retrieve the replication agent status page (/etc/replication/agents.author.html) or the status of each replication agent individually. Parse the output and return the status information in a format of your preference.
  • Integrate this script with your monitoring suite to retrieve alerts for blocked queues.

 

A simple BASH script could look like the following:

 

curl ${CURLPARAMETERS} -o tmp-replication-check.txt "http://${CQ_HOSTPORT}/etc/replication/agents.${CQ_TYPE}.html"

# get activated agents
REPL_AGENTS=`cat tmp-replication-check.txt | grep "cq-agent-header-on" | cut -d"\"" -f4`
REPL_AGENT_NUM=`cat tmp-replication-check.txt | grep "cq-agent-header-on" | cut -d"\"" -f4 | wc -l`
REPL_AGENT_PROBLEMS=0
AGENT_COUNT=0
echo "Found ${REPL_AGENT_NUM} activated replicatication agents, now checking each of them..."
# check all activated agents
for AGENT in ${REPL_AGENTS}
do
        AGENT=${AGENT%.html}
        SHORT_AGENT=`echo ${AGENT} | cut -d"/" -f5`
        AGENT_COUNT=$((AGENT_COUNT + 1))
        # check agent status
        REPL_AGENT_STATUS=`cat tmp-replication-check.txt | grep -a2 "${AGENT}" | egrep "cq-agent-status-|cq-agent-queue-" | cut -d"\"" -f2 | cut -d" " -f2`
        REPL_AGENT_UNEXP_STATUS=`echo "${REPL_AGENT_STATUS}" | grep -v "cq-agent-status-ok" | grep -v "cq-agent-queue-idle" | grep -v "cq-agent-queue-active" | wc -l`
        if [ "${REPL_AGENT_UNEXP_STATUS}" -gt 0 ]; then
                ERROR_OUTPUT=`echo "${REPL_AGENT_STATUS}" | tr "\\n" " "`
                echo "    Agent ${AGENT} has unexpected status: ${ERROR_OUTPUT} - please check!"
                REPL_AGENT_PROBLEMS=$((REPL_AGENT_PROBLEMS + 1))
        else
                echo "    Status of agent ${AGENT} is  OK."
        fi

        # check logfile of agent
	curl ${CURLPARAMETERS} -o tmp-replication-log-${AGENT_COUNT}.txt "http://${CQ_HOSTPORT}${AGENT}.log.html" 

        LOGS_NOTOK=false

        # check logs for ERROR messages
        NUM_LINES_TO_CHECK=25
        NUM_ERRORS_IN_LOGS=`tail -${NUM_LINES_TO_CHECK} tmp-replication-log-${AGENT_COUNT}.txt | grep "ERROR" | wc -l`
        if [ "${NUM_ERRORS_IN_LOGS}" -gt 0 ]; then
                ERROR_OUTPUT_LOG=`tail -${NUM_LINES_TO_CHECK} ptmp-replication-log-${AGENT_COUNT}.txt | grep "ERROR"`
                echo "    Agent ${AGENT} shows ERRORs in logs - please check!"
                echo ${ERROR_OUTPUT_LOG}
                REPL_AGENT_PROBLEMS=$((REPL_AGENT_PROBLEMS + 1))
                LOGS_NOTOK=true
        else
                LOGS_NOTOK=false
                echo "    LOGs of agent   ${AGENT} are OK."
        fi
        # cleanup
        rm -f tmp-replication-log-${AGENT_COUNT}.txt
done

if [ "${REPL_AGENT_PROBLEMS}" -gt 0 ]; then
        echo "Found ${REPL_AGENT_PROBLEMS} problem(s) with agents! Please check!"
        RETURN_CODE=2
else
        echo "All activated agents are ok."
        RETURN_CODE=0
fi

# cleanup
rm -f tmp-replication-check.txt

 

 

 Disclaimer: This script is an extract from a pretty old scripting framework I wrote multiple years ago and has neither been individually tested nor is it ready-to-run. It is just meant for illustration purposes. There is lots of room for improvement, error checks as well as sanitizations are missing and there are probably other/better ways to achieve this. It's also very debatable if using bash is the right approach for this kind of task in the first place. It might be better to use an actual scripting language or develop a proper health check / endpoint in Java as part of your application for monitoring purposes that can directly be consumed by a monitoring system.

 

In addition to the monitoring part you should also investigate to find the root cause of frequently blocked replication queues as this is not a common/normal behavior.

 

Hope this helps!