Hello. I have currently have a replication listener (that generate a pdf when a site is publish) and a scheduler, and they will randomly stop working until I reactive them. I want create a way to monitor and alert my team when they stop working because they are a critical part of our automated process. I read about Health Check which seem like a good solution for the scheduler, but I have not found something for the replication listener.
Did you/your team get a chance to debug its root cause? Does it throw any error in logs?
It seems that you plan to send an automated alert so that someone could manually restart the bundle as a workaround each time it happens.
what you can do is to use a jcr:property like configuration for activate or deactivate your replication listener.
In shortly you can use a page properties dialog with a checkbox which enable and disable the replication listener and after saving the property you can send a notification if the configuration is disabled. Into your replication listener you need to put an if which allow you to proceed with your code only if the property is enabled.
Let me know if could be a solution or need more info.
I don't think that this is a good solution. First of all, you just workaround the problem. And secondly, you allow somebody (an author) to restart services, which can have a lot of consequences (depending on the implementation, but it can lead to a system which is shortly unavailable when it's changed).
I think that we can move this configuration in a system console configuration, in that way only an administrator could enable or disable this configuration. By the way it seems that anyone provide a solution for this point.
Seems that everyone suggest to use healthcheck for scheduler (which is really simple) but no solution provided for replication service.
If you want to observe a scheduled job, a healthcheck can be helpful. You need to implement a healthcheck which checks that the last run of this service was within the last N minutes/hours. Your service has to record the current time before it finishes the processing, and the HC just compares this timestamp with the current time.
Regarding the root cause analysis: I have never had this problem, so it's definitly an issue somewhere. I would doubt that the scheduler broke down, so I rather assume that your service threw an exception and has an inconsistent in-memory state. Check that you handle all exceptions and log them at least.
I checked the logs every time I notice that the job didn't ran in the past few days. The reason why I think the schedule got deactivated because there wasn't any output for the scheduler in the logs, and I was able to active it any when I open the configuration in the system configuration manager and click save regardless if I change anything or not. There a chance the scheduler was deactivate whenever the bundle restarted, I will build the health so I can monitor that is does not happen again
If you were able to crash the scheduler, it's definitely worth a support ticket. Another possible root cause could be that all threads of the scheduler were exhausted (maybe by endless loops or blocked threads), but that should be easy to spot in a threaddump.
we had similar issue for replication listener that will make a call to third party api sometimes it will take more than 10 seconds to get response back and it get stopped randomly, while on debug we noticed the below error,
org.apache.felix.eventadmin EventAdmin: Blacklisting ServiceReference [[org.osgi.service.event.EventHandler]
we found that its causing as because,
"The Apache Felix Event Admin implementation is trying the deliver the events as fast as possible. Events sent from different threads are sent in parallel. Events from the same thread are sent in the order they are received (this is according to the spec). A timeout can be configured which is used for event handlers. If an event handler takes longer than the configured timeout to process an event, it is blacklisted. Once a handler is in a blacklist, it doesn't get sent any events anymore. The Felix Event Admin can be configured either through framework properties or through the configuration admin using PID org.apache.felix.eventadmin.impl.EventAdmin."
we tried increasing the time out value on org.apache.felix.eventadmin.Timeout but that also didn't help.
So what we did, we have moved the part of event handling code to the dedicated thread pool and it never it stops again. Hope this helps.