Expand my Community achievements bar.

Guidelines for the Responsible Use of Generative AI in the Experience Cloud Community.

Garbage Collection versus Tar optimization

Avatar

Level 9

http://www.cqtutorial.com/courses/cq-admin/cq-admin-lessons/cq-maintenance/cq-tar-optimization

In the above link it says "All data (Except datastore data) is persisted in tar file in CQ".

I have the below doubts

1] Does it mean Datastore GC and Tar Pm are two different things

2] Is it possible to include Datastore GC under Tar PM process.

Any thoughts/links/references on the above will be helpful.

5 Replies

Avatar

Level 9

Hi All,

Also, the above doubts I am having is with AEM version 5.6.1 & above

Avatar

Level 6

Yes, they are two different things.

The Data Store GC purges the current Data Store from un-referenced data and temporary created binary data. The datastore contains all the huge binaries. Any property, including binaries, that has a size that is larger than the threshold gets stored in the Data Store and the reference to it is stored in the Persistence manager (TarPM in this case). This makes it possible to share the datastore over multiple instances to avoid copying and huge disks usage. An image is an image regardless who reads it.
 The DS GC takes care of cleaning out the items that does not have a reference from the Repository.

The Tar PM is basically the same, but for the repository. They are two different things. We normally run the Persistance Optimization first and then the DS GC. That holds the disk usage to controllable sizes.

Doing them both at the same time is not a good idea. The reason for that is that if I do Tar PM Opt on Author and have Shared DS and evict the data items from the DS before I have runt Tar PM on my Publish, I will get inconsistancies. If you have a Share Nothing approach, then it does not matter.

/Ove

Avatar

Level 9

Hi Ove,

Thanks a lot for your detailed explanation.

I had few more doubts on this as below:

I was going through the below link

http://dev.day.com/docs/en/crx/current/administering/persistence_managers.html#Tar PM Optimization Case Study

and the section 

"DATA STORE GARBAGE COLLECTION" and found the below lines

a] "it is beneficial to get rid of them to preserve space and to optimize backup and filesystem maintenance performance"

b] "removal of garbage records does not affect normal performance, so this is not a performance optimization"

1] I could understand how the space is preserved. However, I was not clear with the other things mentioned in point #a.Some explanation on this would be helpful.Does it mean if a backup is taken of author 'A' onto a standby author 'B' by some copy operation on a daily basis, performing DataStore GC willl be of help.

2] I was surprised with point #b.

3] I navigated to the path crx-quickstart->Repository->Repository->DataStore in my local OOTB instance.

   I expected to find DAM assets probably images of various formats etc in this location. but I could not find any. 

4] Also, I see two index files in the below mentioned locations[local OOTB instance]

    C:\cq5\author\crx-quickstart\repository\repository\index

    C:\cq5\author\crx-quickstart\repository\workspaces\crx.default\index

    What is the difference between these two.

Any thoughts on the above will be helpful.

Avatar

Employee Advisor

Hi,

To your questions:

1) DSGC is always benefical, as it reduces the disk space needed for a datastore. If you do a full copy of instance A to another place, it will reduce the amount of disk space required for the copy.

2) The Datastore is just a bunch of files in a special structure and with some special naming. It has no index on its own. The persistence manager has links to specific files in this datastore, but the datastore has no links back. So when all references to a file in the datastore are removed, the datastore does not know about this, and it won't remove that file. This is the job the DSGC.

3) You'll find the binaries there, but not with the names you are looking for. This is an internal structure of the repository, and it has its own naming.

4) C:\cq5\author\crx-quickstart\repository\repository\index contains the index for the version workspace, C:\cq5\author\crx-quickstart\repository\workspaces\crx.default\index the index for the crx.default workspace. The locations are not consistent, but you can find the configuration for it in your repository.xml.

HTH,
Jörg

Avatar

Level 9

Hi Jorg,

Thanks a lot for your detailed explanation.