Expand my Community achievements bar.

SOLVED

AEM 6.5 - Binary Less replication in Shared S3 datastore

Avatar

Level 2

Hi,

We do have customized AEM asset portal that uses shared S3 datastore. While replicating assets in bulk amount(600 assets) from author to publish via binary-less replication we have observed that publisher becomes unresponsive and start giving oops page(Error : 500). Publisher automatically recovered when replication queue items reduced. For instance, within 30 sec. author makes around 300-400 POST request to publisher to publish the assets. Also publisher makes request to dispatcher flush to clear cache from dispatcher. Sometimes we observed around 1500+ items in replication queue which makes publisher unresponsive. 

 

How we can  increase throughput of binary-less replication?

Is there any way to make replication in chunks or batches?

How we can tune-up our publisher to handle bulk request?

 

1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

Hi,

your scenario is not really clear to me. I understand that you are replicating hundreds of assets in bulk and that during that time the publish instance is unable to respond to other incoming requests.

 

Let me comment some of your statements:

  • within 30 seconds 400-500 POST requests: That makes me think that the replication itself is quite quick. How do you trigger the replication on author? Using the standard "replicator.replicate()" method or do you use synchronous replication with many threads in parallel? The first one will only replicate one asset at a time, while in the second case you will have multiple replication requests at once.
  • You should validate the dispatcher logs and see what is happening. Do a few threaddumps and check what's happening. And what is the exception message of these "internal server error" requests you mentioned?

If you can answer these questions we can help you further.

 

Jörg

 

View solution in original post

3 Replies

Avatar

Correct answer by
Employee Advisor

Hi,

your scenario is not really clear to me. I understand that you are replicating hundreds of assets in bulk and that during that time the publish instance is unable to respond to other incoming requests.

 

Let me comment some of your statements:

  • within 30 seconds 400-500 POST requests: That makes me think that the replication itself is quite quick. How do you trigger the replication on author? Using the standard "replicator.replicate()" method or do you use synchronous replication with many threads in parallel? The first one will only replicate one asset at a time, while in the second case you will have multiple replication requests at once.
  • You should validate the dispatcher logs and see what is happening. Do a few threaddumps and check what's happening. And what is the exception message of these "internal server error" requests you mentioned?

If you can answer these questions we can help you further.

 

Jörg

 

Avatar

Level 2

Hi Jorg,

 

Your understanding is correct. We have one Author and one Publish instance sharing S3 datastore.

  • Author instance has scheduler which is running every hour and pull assets from S3 bucket, process it and replicate using "replicator.replicate()" method to publish. This replication is one at a time.
  • We'll analyse the dispatcher logs and share our findings.
  • This bulk replication was working fine when we were running on AEM 6.3. Post up-gradation to AEM 6.5 we started facing this issue.

 

Avatar

Employee Advisor

Thanks for the confirmation. I doubt that in your case the S3 datastore has something to do with it at all, because just the asset binary is stored in S3, and just iterating over the node and reading metadata will not trigger any read from S3.

The fact, that you do this via a single scheduled job, makes me think that only a single thread is involved, which should not cause a problem at all on publish side. The upgrade from 6.3 to 6.5 also shouldn't cause this, but of course this makes me think that some configuration went wrong during that upgrade. Can you check that the Asset Update workflow is not triggered on publish? Also, are you sure that the load pattern has not changed between 6.3 and 6.5; otherwise you are looking for issues in the code/configuration with 6.5, but the problem is the (amount of) data you are processing, and it would have affected 6.3 the same way as it affects now 6.5.

 

Jörg