Can you identify valid blob ids in File Data Store and remove unused files? | Community
Skip to main content
Level 4
June 23, 2022

Can you identify valid blob ids in File Data Store and remove unused files?

  • June 23, 2022
  • 2 replies
  • 2566 views

Marking each AEM instance for file datastore GC is one thing.  But I'm wondering if there is a way, maybe with oak-run that I don't know about, that can do the reverse: get all the blob ids in the File Data Store and check against each AEM instance and mark any unused files.

So for example... if I manually created a file dog.txt in the FDS directly.  Is there a command I can run that would mark that as never used in any of my AEM instances?

I have a Externally Shared FDS, using binary less replication, 1 author, 3 publish.  I feel like I'm about 150G above what I should be.

 

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.

2 replies

SantoshSai
Community Advisor
Community Advisor
June 23, 2022

Hi @sdouglasmcsonova ,

I have created script to clean unused references achieve the same as below, customize as per your requirement 

#!/bin/bash
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar console crx-quickstart/repository/segmentstore < cleanup.commands
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar checkpoints crx-quickstart/repository/segmentstore
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar checkpoints crx-quickstart/repository/segmentstore rm-unreferenced
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar checkpoints crx-quickstart/repository/segmentstore rm-all
java -Xmx1g -Doak.compaction.eagerFlush=true -Doffline-compaction=true -jar tools/oak-run-1.40.0.jar compact crx-quickstart/repository/segmentstore
rm crx-quickstart/repository/segmentstore/*.tar.bak
echo "Finished"

rmNode.groovy

import org.apache.jackrabbit.oak.spi.commit.CommitInfo
import org.apache.jackrabbit.oak.spi.commit.EmptyHook
import org.apache.jackrabbit.oak.spi.state.NodeStateUtils
import org.apache.jackrabbit.oak.spi.state.NodeStore
import org.apache.jackrabbit.oak.commons.PathUtils

def rmNode(def session, String path, boolean includingThis = true) {
if(!includingThis) {
println "Removing subnodes of ${path}"

def ns = NodeStateUtils.getNode(session.getRoot(), path);
for(def subNodeName : ns.getChildNodeNames()) {
if(!subNodeName.equals("rep:policy")) {
String subpath = path + "/" +subNodeName;
rmNode(session, subpath);
}
}
} else {
println "Removing node ${path}"

NodeStore ns = session.store
def nb = ns.root.builder()

def aBuilder = nb
for(p in PathUtils.elements(path)) {
aBuilder = aBuilder.getChildNode(p)
}
if(aBuilder.exists()) {
rm = aBuilder.remove()
ns.merge(nb, EmptyHook.INSTANCE, CommitInfo.EMPTY)
return em
} else {
priln "Node ${path} doesn't exist"
return false;
}
}
}

 Create tools folder parallel to your script and add above rmNode.groovy and jar from here .

Hope that helps!

Regards,

Santosh

Santosh Sai
Level 4
June 23, 2022

Appreciated, but that isn't going to be what I'm looking for the way we have things set up. 

joerghoh
Adobe Employee
Adobe Employee
June 25, 2022

Why does the datastore GC process not work for you? I am not aware of any other way to do it.

Level 4
June 26, 2022

It does work.  That's not the issue I'm having.  I notice that we have spiked about 150G somewhere in the past 2 months and I'm trying to figure out why.  So, I'd like to figure out a way I can validate all the blobids in the FDS against the repositories and not the other way around.  Do you know if the marksweep uses any indexes or just the blobid cache on the filesystem?

joerghoh
Adobe Employee
Adobe Employee
July 13, 2022

It does clean up binaries.  It works just fine.  We to nightly revision cleanup on all 4 (1 author, 3 publish).

It just seems really odd we are sitting at about 620G shared when our author instance is around 320G itself.  I'm just looking to find a way of relating the blob ids back to their related AEM instance.  Obviously, I would expect 95% of the blob ids in the publish instances to be shared among all the instances.


How do you know that your author has only 320GB? Did you use the DiskUsage report to determine this?

 

In any case: can you relate this growth of the Datastore to any event? Is it increasing every day (no matter what you are doing) by 1GB? Is it increasing when you deploy?