Expand my Community achievements bar.

SOLVED

Replication performance issues

Avatar

Level 7

Hi all,

We have noticed that when our CQ5 environment is up and running for a few days, replication performance degrades significantly.

When the system is healthy, at peak moments, there's 2MB/s data transfer observed in the network between the author/publisher instances.

However when the performance degrades, network traffic is reduced to 100-200KB/s.

Heap and PermGen memory usage on both publisher and author instances are within the normal range, GC runs are normal and quick, author JVM has less than 20% CPU usage and the publisher JVM reports 50% CPU usage.

We have also increased the Sling Eventing Thread Pool's size to 100 on author instances per recommendations we have received from DayCare.

A few questions:

  • Is there anyway we can increase the number of threads that an agent uses to replicate data from an author instance to a publisher instance?
  • Are there any settings we can tune on publisher instances to improve performance?
  • What is the purpose of the "Batch" settings of a replication agent and how do they kick in? We have set the batch size to 10,000 and the delay to 60 seconds but they don't seem to have any kind of impact on performance?

Any other way we can mitigate this issue?

Thanks in advance.

1 Accepted Solution

Avatar

Correct answer by
Level 10

LinearGradient wrote...

Hi all,

We have noticed that when our CQ5 environment is up and running for a few days, replication performance degrades significantly.

Indicates an network issue check with your network team.

When the system is healthy, at peak moments, there's 2MB/s data transfer observed in the network between the author/publisher instances.

However when the performance degrades, network traffic is reduced to 100-200KB/s.

It is otherway around, network slowness causing more time for job to complete.

Heap and PermGen memory usage on both publisher and author instances are within the normal range, GC runs are normal and quick, author JVM has less than 20% CPU usage and the publisher JVM reports 50% CPU usage.

We have also increased the Sling Eventing Thread Pool's size to 100 on author instances per recommendations we have received from DayCare.

A few questions:

  • Is there anyway we can increase the number of threads that an agent uses to replicate data from an author instance to a publisher instance?

Replication is FIFO. Increasing thread does not make sense.

  • Are there any settings we can tune on publisher instances to improve performance?

Make sure network connectivity is good & use package if replicating lot of content.

  • What is the purpose of the "Batch" settings of a replication agent and how do they kick in? We have set the batch size to 10,000 and the delay to 60 seconds but they don't seem to have any kind of impact on performance?

As name indicates collects all the replication agents & triggers when certain thrush hold reaches. 

Any other way we can mitigate this issue?

Get official help by sending all logs, thread dumps at author & publish.

Thanks in advance.

 

View solution in original post

1 Reply

Avatar

Correct answer by
Level 10

LinearGradient wrote...

Hi all,

We have noticed that when our CQ5 environment is up and running for a few days, replication performance degrades significantly.

Indicates an network issue check with your network team.

When the system is healthy, at peak moments, there's 2MB/s data transfer observed in the network between the author/publisher instances.

However when the performance degrades, network traffic is reduced to 100-200KB/s.

It is otherway around, network slowness causing more time for job to complete.

Heap and PermGen memory usage on both publisher and author instances are within the normal range, GC runs are normal and quick, author JVM has less than 20% CPU usage and the publisher JVM reports 50% CPU usage.

We have also increased the Sling Eventing Thread Pool's size to 100 on author instances per recommendations we have received from DayCare.

A few questions:

  • Is there anyway we can increase the number of threads that an agent uses to replicate data from an author instance to a publisher instance?

Replication is FIFO. Increasing thread does not make sense.

  • Are there any settings we can tune on publisher instances to improve performance?

Make sure network connectivity is good & use package if replicating lot of content.

  • What is the purpose of the "Batch" settings of a replication agent and how do they kick in? We have set the batch size to 10,000 and the delay to 60 seconds but they don't seem to have any kind of impact on performance?

As name indicates collects all the replication agents & triggers when certain thrush hold reaches. 

Any other way we can mitigate this issue?

Get official help by sending all logs, thread dumps at author & publish.

Thanks in advance.