Your achievements

Level 1

0% to

Level 2

Tip /
Sign in

Sign in to Community

to gain points, level up, and earn exciting badges like the new
Bedrock Mission!

Learn more

View all

Sign in to view all badges

SOLVED

One Publisher node flapping issue

Avatar

Level 2

Hi All,

I am working on AEM 6.4 and our application running on 1 Author and 3 Publishers(P1.P2,P3) node.

In last couple of months we have experienced that one of our Publisher server starts flapping or becomes quite slow sometimes (due to lack of disk space/network issue for various reasons) but was never completely down. It has always been trying to connect but could not able to connect so finally we had to restart that publisher to resolve the issue. Its not like only one particular Publisher behaving like this in last 3 incidents two of the publishers have the same behavior.

Now during this interval the request was coming on effected publisher that it got stuck and for the end users it was showing the page is keep trying to connect.

If publisher is completely down in that case we can remove that publisher from load balancer. But how can we avoid such scenario where one of the publisher is flapping and not able to serve the request, please suggest how can we ignore that effected publisher, so that we can transfer all the request to working publishers.

 

Please suggest.

1 Accepted Solution

Avatar

Correct answer by
Level 3

Hello there,

 

It could be happening because of a broad range of reasons, so let's narrow these down a little. Do you see any spikes in the CPU usage on your author at a frequent rate? Perhaps there could be some sort of replication happening from your author that has gone unnoticed. The best way to find out is check your replication agent logs. We recently ran into an issue where an old custom index was triggered accidentally (This was meant to be removed) and this job was scheduled to trigger every 10 minutes. The replication agents were transmitting a huge amount of data which completely overwhelmed our publishers. Even though the java process seemed to be running on these servers (where the publishers are hosted), the publishers had actually crashed.

 

Also, having some sort of health checks in place might be helpful such that the publisher is taken out of the load balancer if it returns a certain status code. 

 

Hopefully this helps. Thanks!

View solution in original post

2 Replies

Avatar

Correct answer by
Level 3

Hello there,

 

It could be happening because of a broad range of reasons, so let's narrow these down a little. Do you see any spikes in the CPU usage on your author at a frequent rate? Perhaps there could be some sort of replication happening from your author that has gone unnoticed. The best way to find out is check your replication agent logs. We recently ran into an issue where an old custom index was triggered accidentally (This was meant to be removed) and this job was scheduled to trigger every 10 minutes. The replication agents were transmitting a huge amount of data which completely overwhelmed our publishers. Even though the java process seemed to be running on these servers (where the publishers are hosted), the publishers had actually crashed.

 

Also, having some sort of health checks in place might be helpful such that the publisher is taken out of the load balancer if it returns a certain status code. 

 

Hopefully this helps. Thanks!

Avatar

Employee Advisor

You should not configure all publish instances statically in the loadbalancer, but the loadbalancer should probe all publishs and only on succesful results of a probe the loadbalancer should sent requests to it. That's a feature every loadbalancer has.

 

Regarding the probe: I would recommend to use a simple HTTP(S) request which AEM (not the dispatcher!) can handle within a few miliseconds (in most cases a simple servlet should be sufficient). In case of network issues or massive slowness of the instance the response should be slower so the loadbalancer will disconnect from it until such probes succeed again.

 

(This is nothing special to AEM. And you really should understand why your instances show this behavior.)