Expand my Community achievements bar.

Guidelines for the Responsible Use of Generative AI in the Experience Cloud Community.

Can you identify valid blob ids in File Data Store and remove unused files?

Avatar

Level 5

Marking each AEM instance for file datastore GC is one thing.  But I'm wondering if there is a way, maybe with oak-run that I don't know about, that can do the reverse: get all the blob ids in the File Data Store and check against each AEM instance and mark any unused files.

So for example... if I manually created a file dog.txt in the FDS directly.  Is there a command I can run that would mark that as never used in any of my AEM instances?

I have a Externally Shared FDS, using binary less replication, 1 author, 3 publish.  I feel like I'm about 150G above what I should be.

 

10 Replies

Avatar

Community Advisor

Hi @sdouglasmc ,

I have created script to clean unused references achieve the same as below, customize as per your requirement 

#!/bin/bash
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar console crx-quickstart/repository/segmentstore < cleanup.commands
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar checkpoints crx-quickstart/repository/segmentstore
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar checkpoints crx-quickstart/repository/segmentstore rm-unreferenced
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar checkpoints crx-quickstart/repository/segmentstore rm-all
java -Xmx1g -Doak.compaction.eagerFlush=true -Doffline-compaction=true -jar tools/oak-run-1.40.0.jar compact crx-quickstart/repository/segmentstore
rm crx-quickstart/repository/segmentstore/*.tar.bak
echo "Finished"

rmNode.groovy

import org.apache.jackrabbit.oak.spi.commit.CommitInfo
import org.apache.jackrabbit.oak.spi.commit.EmptyHook
import org.apache.jackrabbit.oak.spi.state.NodeStateUtils
import org.apache.jackrabbit.oak.spi.state.NodeStore
import org.apache.jackrabbit.oak.commons.PathUtils

def rmNode(def session, String path, boolean includingThis = true) {
if(!includingThis) {
println "Removing subnodes of ${path}"

def ns = NodeStateUtils.getNode(session.getRoot(), path);
for(def subNodeName : ns.getChildNodeNames()) {
if(!subNodeName.equals("rep:policy")) {
String subpath = path + "/" +subNodeName;
rmNode(session, subpath);
}
}
} else {
println "Removing node ${path}"

NodeStore ns = session.store
def nb = ns.root.builder()

def aBuilder = nb
for(p in PathUtils.elements(path)) {
aBuilder = aBuilder.getChildNode(p)
}
if(aBuilder.exists()) {
rm = aBuilder.remove()
ns.merge(nb, EmptyHook.INSTANCE, CommitInfo.EMPTY)
return em
} else {
priln "Node ${path} doesn't exist"
return false;
}
}
}

 Create tools folder parallel to your script and add above rmNode.groovy and jar from here .

Hope that helps!

Regards,

Santosh

Avatar

Level 5

Appreciated, but that isn't going to be what I'm looking for the way we have things set up. 

Avatar

Employee Advisor

Why does the datastore GC process not work for you? I am not aware of any other way to do it.

Avatar

Level 5

It does work.  That's not the issue I'm having.  I notice that we have spiked about 150G somewhere in the past 2 months and I'm trying to figure out why.  So, I'd like to figure out a way I can validate all the blobids in the FDS against the repositories and not the other way around.  Do you know if the marksweep uses any indexes or just the blobid cache on the filesystem?

Avatar

Employee Advisor

There is no other way, you need to run the Datastore GC to identify all used blobs and then let it clear all unused ones. 

Avatar

Level 5

I guess my question would then be a bit more general... How does it compile a list of unused ones?

I typically run mark-sweep = true on all 3 publish instances and author.  I then run mark-sweep= false on author to clean it all up.  There must be a list consolidated somewhere (DS maybe? cached?) that has a list to remove - because it won't remove just anything.  As in, if I create a file in the DS itself, it doesn't get removed.  


Avatar

Employee Advisor

So when you execute the dataStoreGC it does not clean up any binaries? Is this the reason why you want to find out the unused binaries? In that case I would raise a support ticket first.

 

Do you regularly execute the revision cleanup on all nodes? This is a prerequisite that the datastoreGC is able to remove binaries. Otherwise you might have old references into the datastore, and these old references are not used anymore and would be removed by the Revision Cleanup (aka "compaction"). But that the datastoreGC Mark phase is not capable to detect them as "old".

 

 

Avatar

Level 5

It does clean up binaries.  It works just fine.  We to nightly revision cleanup on all 4 (1 author, 3 publish).

It just seems really odd we are sitting at about 620G shared when our author instance is around 320G itself.  I'm just looking to find a way of relating the blob ids back to their related AEM instance.  Obviously, I would expect 95% of the blob ids in the publish instances to be shared among all the instances.

Avatar

Employee Advisor

How do you know that your author has only 320GB? Did you use the DiskUsage report to determine this?

 

In any case: can you relate this growth of the Datastore to any event? Is it increasing every day (no matter what you are doing) by 1GB? Is it increasing when you deploy? 

 

Avatar

Level 5

Yes, I've used the Disk Usage report.

That's the tough part, relating this to an event - Going from 400G to 600G.  But oddly enough we are growing about 2G every hour - but at least it is recoverable through FDS GC.  We are experiencing, since our 6.5.12 upgrade, problems with our indexes and we've finally got them sorted out on our testing instance - but not production yet.  The async lane is failing with the oak:lucene and damAsset indexes.