Replication agents broken




We are running AEM 6.3. A few days ago one of our publishers was restarted during the day by the server admin team which broke the replication agent for that server on the author. The queue was blocked and would not update even after the publisher came back up. The "next retry in.." message was counting down into negative numbers. I was able to still manually push the content through by clicking force retry (which would publish just the first item in the list) and then clearing that item, repeating this until i got through the entire queue. After this, with nothing in the queue, it was still blocked. The only way i could fix it was to rename the agent so that it treated it like a new one.

Whenever i opened the agent's queue page i saw this message in the log (the red text changed every time) -

POST /etc/replication/agents.author/twsmwinfzap02---2/jcr:content.queue.json HTTP/1.1] com.day.cq.replication.impl.ReplicationContentFactoryProviderImpl Replication content node does not exist /var/replication/data/5feebbac-3af6-436d-b8ba-f89a7a5bf40b/69/6915ed86-b6da-496f-ae8e-6e64398f834e.

Since that day it has happened 2 more times on different agents if a publish is disrupted (e.g. a large PDF is transferring and has a connection issue).

Any ideas how we can fix this?