Expand my Community achievements bar.

SOLVED

Aem segmentstore vs datastore

Avatar

Level 4

HI,

 

In Adobe Aem which contents are saved in the segmentstore and which are saved in the datastore?

 

What is the difference between the two?

 

Thank you

1 Accepted Solution

Avatar

Correct answer by
Level 3

Hey Roberto - 

In overly-simplistic terms, the Segmentstore contains all of the metadata about each of the nodes in the repository, including the content tree itself, and all of the textual information about each node.  The Datastore is used for larger objects that need to be stored in AEM, which could be text data or binaries. 

 

There's a base configuration in Jackrabbit Oak which tells the system what the minimum size is of objects that should be stored in the Datastore vs the Segmentstore, which I believe still sits at 16KB.  So, let's say you add a page node to AEM, and then put 2KB of text in the page. That entire object is then living in the segmentstore.  But let's say you upload a 100KB PDF - in that case, the metadata about the PDF (its title, description, jcr properties, location, tags, etc) all are physically stored in the segmentstore, but the binary data of the PDF itself is stored in the datastore with only pointers in the segmentstore to where to find it.

 

This is why there are two different maintenance jobs in AEM - one to clean up the segmentstore, the other to clean up the datastore.  If, let's say, the reference to that PDF is deleted in AEM, it would only get flagged for deletion in the segmentstore, but would then still be on disk.  The revision clean-up job would then be able to reclaim that disk space out of the segmentstore when it runs, but the datastore would still contain the binary data for that PDF until the datastore cleanup job gets run, to remove any now-unreferenced objects out of the datastore. 

 

Hope that helps! 

View solution in original post

6 Replies

Avatar

Community Advisor

HI @robertol6836527 ,

In Adobe Experience Manager (AEM), the content repository is divided into two main storage areas: the segment store and the data store . They handle different types of data:

Segment store: This stores the content and properties of your AEM pages. It essentially holds the metadata that describes your content. AEM uses a segment store implementation called TarMK by default, which stores this data in TAR files.

Data store: This stores the binary data associated with your content, such as images, videos, and documents. Data stores are separate from the segment store to improve performance and scalability. AEM can use various data store options, including a default file system data store or external options like Amazon S3.

 

Thanks,

Somen

Avatar

Administrator

@robertol6836527 Did you find the suggestions from users helpful? Please let us know if more information is required. Otherwise, please mark the answer as correct for posterity. If you have found out solution yourself, please share it with the community.



Kautuk Sahni

Avatar

Correct answer by
Level 3

Hey Roberto - 

In overly-simplistic terms, the Segmentstore contains all of the metadata about each of the nodes in the repository, including the content tree itself, and all of the textual information about each node.  The Datastore is used for larger objects that need to be stored in AEM, which could be text data or binaries. 

 

There's a base configuration in Jackrabbit Oak which tells the system what the minimum size is of objects that should be stored in the Datastore vs the Segmentstore, which I believe still sits at 16KB.  So, let's say you add a page node to AEM, and then put 2KB of text in the page. That entire object is then living in the segmentstore.  But let's say you upload a 100KB PDF - in that case, the metadata about the PDF (its title, description, jcr properties, location, tags, etc) all are physically stored in the segmentstore, but the binary data of the PDF itself is stored in the datastore with only pointers in the segmentstore to where to find it.

 

This is why there are two different maintenance jobs in AEM - one to clean up the segmentstore, the other to clean up the datastore.  If, let's say, the reference to that PDF is deleted in AEM, it would only get flagged for deletion in the segmentstore, but would then still be on disk.  The revision clean-up job would then be able to reclaim that disk space out of the segmentstore when it runs, but the datastore would still contain the binary data for that PDF until the datastore cleanup job gets run, to remove any now-unreferenced objects out of the datastore. 

 

Hope that helps! 

Avatar

Level 1

Hello,

Please also suggest on below:

In practice, should full GC revision purge run before data Store GC, or after?

Lets say my revision cleanup (maintaining segmentstore) is running daily with Full GC on Sunday, so weekly Datastore GC should be Monday or Friday?

Does this order affect the disk storage and AEM performance during the week?

Avatar

Level 3

You want to run the revision GC (i.e. tar compaction) before the Datastore GC.  The Datastore GC depends on the revision GC to know what blobs it can remove that are no longer referenced.  A Datastore GC run by itself (i.e. without any revision GC) won't reclaim anything.  So, let's say you run a full revision GC on Saturday, you could then run your Datastore GC on Sunday so as to take advantage of the earlier cleanup.