Expand my Community achievements bar.

SOLVED

S3 DataStore growth issue

Avatar

Level 4

We are using S3 for AEM Data Storage, right from AEM 5.6.1 (since 2 years ).Currently we are using AEM 6.2 SP2.We have a single author (with a cold stand by ) and large number of  publishers (around 26 AEM instances) are connected to single S3 bucket.

Before migration to 6.2 ,total S3 size was 2 TB , and now in the past a year or so , it has reached 90+ TB (2018 March), with an average growth of 300 GB /day.The interesting thing is our editors are only uploading maximum 1 GB assets per day ,but S3 Growth is 300 GB/day.

  1. All replications are configured to be "Binaryless" and we can confirm from the log that the replicated contents are far less than their actual sizes
  2. We are performing DataStore GC every 6 months. In the last GC process ,the size reduced to 61 TB from 75 TB,but average S3 Growth is 300 GB/day.We do perform periodical clean ups of Audit logs, Version purge, Workflow purge ( on a weekly basis , along with Tar Compaction)
  3. We had versions enabled at S3 , with a rule to remove all deleted versions that are older than 30 days.But now we removed versioning ,but no luck.
1 Accepted Solution

Avatar

Correct answer by
Level 4

Hi,

Our company has also suffered from the same problem. We have a shared S3 bucket (1 author, 4 publishers) and when we started off, it was a little less than 1 TB in size, but over the time of a few weeks, it starting growing rapidly up to a whopping 5 TB (5x increase).

It was indeed not in comparison to the amount of binaries and pages we created in AEM, even with all other maintenance procedures in place.

We automated the data store gc procedure by calling the JMX bean on every instance with the markonly and than once more on our primary author instance without the markonly option.

Check Data Store Garbage Collection on how to achieve that.

Our automated procedure always happily reported that the datastore GC was succesful, but our S3 bucket kept growing and growing every day at a faster pace.

We noticed that JMX call doesn't report that the procedure fails!

So check your error.log for INFO/WARN/ERROR messages about the datastore GC procedure. In our case it was something like "Not all repositories have marked references available ..."

So the data stored in the S3 bucket below the META folder concerning the instances in the shared datastore setup was not correct, causing the cleanup procedure not to do anything.

But once the data was corrected, we started to see the S3 bucket size to drop rapidly. At the moment we are back to about 1.4 TB.

(We also used S3 versioning, configured to keep 2 weeks of deleted data, so it took 2 weeks before we noticed the first drop in S3 size.)

Hint: If you are on Oak 1.6.x, use the newer Datastore GC JMX bean as documented on Jackrabbit Oak – The Blob Store (MarkSweepGarbageCollector#collectGarbage(boolean markOnly, boolean forceBlobRetrieve)). When setting the forceBlobRetrieve=true, the procedure executes much faster if you have such a big datastore repository like you do.

Hint 2: run your datastore GC more frequently. It takes much less time if you do. We now run it every day (taking about 1h).

Hope this helps!

Kind regards

Wim

View solution in original post

3 Replies

Avatar

Employee

Can you switch on the debug logs to understand what's getting into the datastore at this fast pace ?

org.apache.jackrabbit.aws.ext.ds

org.apache.jackrabbit.oak.plugins.blob

Avatar

Correct answer by
Level 4

Hi,

Our company has also suffered from the same problem. We have a shared S3 bucket (1 author, 4 publishers) and when we started off, it was a little less than 1 TB in size, but over the time of a few weeks, it starting growing rapidly up to a whopping 5 TB (5x increase).

It was indeed not in comparison to the amount of binaries and pages we created in AEM, even with all other maintenance procedures in place.

We automated the data store gc procedure by calling the JMX bean on every instance with the markonly and than once more on our primary author instance without the markonly option.

Check Data Store Garbage Collection on how to achieve that.

Our automated procedure always happily reported that the datastore GC was succesful, but our S3 bucket kept growing and growing every day at a faster pace.

We noticed that JMX call doesn't report that the procedure fails!

So check your error.log for INFO/WARN/ERROR messages about the datastore GC procedure. In our case it was something like "Not all repositories have marked references available ..."

So the data stored in the S3 bucket below the META folder concerning the instances in the shared datastore setup was not correct, causing the cleanup procedure not to do anything.

But once the data was corrected, we started to see the S3 bucket size to drop rapidly. At the moment we are back to about 1.4 TB.

(We also used S3 versioning, configured to keep 2 weeks of deleted data, so it took 2 weeks before we noticed the first drop in S3 size.)

Hint: If you are on Oak 1.6.x, use the newer Datastore GC JMX bean as documented on Jackrabbit Oak – The Blob Store (MarkSweepGarbageCollector#collectGarbage(boolean markOnly, boolean forceBlobRetrieve)). When setting the forceBlobRetrieve=true, the procedure executes much faster if you have such a big datastore repository like you do.

Hint 2: run your datastore GC more frequently. It takes much less time if you do. We now run it every day (taking about 1h).

Hope this helps!

Kind regards

Wim