Expand my Community achievements bar.

Guidelines for the Responsible Use of Generative AI in the Experience Cloud Community.
SOLVED

Entire repo in crx-quickstart/segmentstore (tar files) vs items over 4k in crx-quickstart/datastore

Avatar

Level 4

When we installed AEM 6.1, the default configuration has everything stored in crx-quickstart/segmentstore.

 

Our segmentstore now has over 3000 tar files and is roughly 120GB.

 

Compaction, if not done frequently, can take several hours, depending on the strength of the machine.

 

I'm working to change the repository configuration to store blobs over 4k on the filesystem in crx-quickstart/repository/datastore.

 

I would like to confirm this is a better practice than having everything in tar files.

 

A benefit of all in tar files may be a central location for everything and (more or less) the single maintenance task of compaction?

 

A separate datastore seems like it might perform better due to less overhead compared to searching through tar files? A separate datastore also seems to be the default config in later AEM versions?

 

A possible downside to a split repo would possibly be the need to run a separate and manual garbage collection process for maintenance?

 

Please advise on which is the best practice, and why it is the best practice.

 

Thanks!

1 Accepted Solution

Avatar

Correct answer by
Level 6

Hi @this-that-the-otter,

 

As per the Adobe documenation on this:

 

"When dealing with large number of binaries, it is recommended that an external data store be used instead of the default node stores in order to maximize performance.

For example, if your project requires a large number of media assets, storing them under the File or S3 Data Store will make accessing them faster than storing them directly inside a MongoDB.

The File Data Store provides better performance than MongoDB, and Mongo backup and restore operations are also slower with large number of assets."

 

Storage wise, AEM with shared S3 datastore will give you good savings. Plus configuring binary-less replication will make replication faster since the payload doesn't have to be transferred over the n/w but just a pointer that has to be updated with each replication event.

 

Point of caution would be to disable the OOTB garbage collection as it may lead to unexpected blobs deletion. Instead you would need to use the markOnly option for future runs of garbage collection. One more problem from experience is that the repository does become inconsistent at times so you would need to know how to do a datastore consistency check and restore missing blobs. 

 

https://experienceleague.adobe.com/docs/experience-manager-65/deploying/deploying/data-store-config....

 

https://experienceleague.adobe.com/docs/experience-manager-64/administering/operations/data-store-ga...

 

Thanks, 

Ram

 

 

View solution in original post

6 Replies

Avatar

Community Advisor

I believe Adobe still recommends monthly off-line maintenance. This monthly planned downtime is a very good idea for two reasons:

1) Off-Line tar Compaction is still better at re-claiming wasted storage than its On-Line counterpart

2) The stop/start cycle will expose other repository integrity issues that crop up only when AEM is re-started (at the most inopportune time!).

Screenshot 2021-09-14 at 9.38.35 AM.png

Avatar

Adobe Champion

Judging from past experience, I would suggest starting small before making major repo changes to tackle the issue.

 

What version of AEM are you now using? Still 6.1?

 

Indeed using a datastore for blobs is the suggested way (we often switched to S3 DS in the past)

 

Are you using the OOTB maintenance tasks, e.g. revision and WF cleanup tasks? Do you perform older versions cleanup?

Avatar

Level 4

@KimonP @Manu_Mathew_

 

Thanks for replying, this is indeed for an 6.1 instance - with everything currently in a tar file segmentstore.

 

It sounds like the answer to my question (according to kimonp31365843) is a datastore for blobs (vs everything in segementstore) is preferred, i.e. "Indeed using a datastore for blobs is the suggested way (we often switched to S3 DS in the past)"

 

I gave a few reasons (for and against) why I also thought a blob datastore would be preferred. Can anyone confirm or expand on why a separate blob datastore is preferred vs everything in tar file segmentstore? 

 

S3 sounds like a great option, as well as possibly MongoDB, but we're not considering these at the moment.

 

Currently we're only doing offline compaction. Currently the repo is all in segmentstore (tar files). We're considering splitting the repo to follow possible best practice, and to hopefully increase performance, as well as make maintenance more manageable.

 

Thanks!

Avatar

Correct answer by
Level 6

Hi @this-that-the-otter,

 

As per the Adobe documenation on this:

 

"When dealing with large number of binaries, it is recommended that an external data store be used instead of the default node stores in order to maximize performance.

For example, if your project requires a large number of media assets, storing them under the File or S3 Data Store will make accessing them faster than storing them directly inside a MongoDB.

The File Data Store provides better performance than MongoDB, and Mongo backup and restore operations are also slower with large number of assets."

 

Storage wise, AEM with shared S3 datastore will give you good savings. Plus configuring binary-less replication will make replication faster since the payload doesn't have to be transferred over the n/w but just a pointer that has to be updated with each replication event.

 

Point of caution would be to disable the OOTB garbage collection as it may lead to unexpected blobs deletion. Instead you would need to use the markOnly option for future runs of garbage collection. One more problem from experience is that the repository does become inconsistent at times so you would need to know how to do a datastore consistency check and restore missing blobs. 

 

https://experienceleague.adobe.com/docs/experience-manager-65/deploying/deploying/data-store-config....

 

https://experienceleague.adobe.com/docs/experience-manager-64/administering/operations/data-store-ga...

 

Thanks, 

Ram

 

 

Avatar

Level 4

Hi Ram,

Thanks for your reply. We split the all-in-segmentstore tar file repo to tar + datastore a few weeks ago, and things seem to be working fine.

For maintenance, I now typically do offline tar (revision cleanup) compaction starting with removal of unreferenced checkpoints  - as I had done with all-in-segmentstore tar repo, but it tends to go quicker now with less tar files since the repo split.

Immediately following the tar compaction, I do datastore garbage collection triggered via the JMX / repository / garbage collection (delete blobs true) console.

Regarding:

Point of caution would be to disable the OOTB garbage collection as it may lead to unexpected blobs deletion. Instead you would need to use the markOnly option for future runs of garbage collection. One more problem from experience is that the repository does become inconsistent at times so you would need to know how to do a datastore consistency check and restore missing blobs.

Is this something I need to consider in AEM 6.1? I'm not sure if there are any garbage collection jobs scheduled. I've never seen any mention of GC in the logs (except for tag GC), until I've run GC manually after the repo split.

I'm using the delete option for garbage collection, running it immediately after revision cleanup / tar compaction.

We're not using S3 or Mongo, each author & publish instance have their own filesystem based repo.