Level 4

Can you identify valid blob ids in File Data Store and remove unused files?

Forum|Forum|3 years ago
June 23, 2022
2 replies
2566 views

Marking each AEM instance for file datastore GC is one thing. But I'm wondering if there is a way, maybe with oak-run that I don't know about, that can do the reverse: get all the blob ids in the File Data Store and check against each AEM instance and mark any unused files.

So for example... if I manually created a file dog.txt in the FDS directly. Is there a command I can run that would mark that as never used in any of my AEM instances?

I have a Externally Shared FDS, using binary less replication, 1 author, 3 publish. I feel like I'm about 150G above what I should be.

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.

SantoshSai

Community Advisor

Hi @sdouglasmcsonova ,

I have created script to clean unused references achieve the same as below, customize as per your requirement

#!/bin/bash
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar console crx-quickstart/repository/segmentstore < cleanup.commands
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar checkpoints crx-quickstart/repository/segmentstore
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar checkpoints crx-quickstart/repository/segmentstore rm-unreferenced
java -Xmx1g -Doak.compaction.eagerFlush=true -jar tools/oak-run-1.40.0.jar checkpoints crx-quickstart/repository/segmentstore rm-all
java -Xmx1g -Doak.compaction.eagerFlush=true -Doffline-compaction=true -jar tools/oak-run-1.40.0.jar compact crx-quickstart/repository/segmentstore
rm crx-quickstart/repository/segmentstore/*.tar.bak
echo "Finished"

rmNode.groovy

import org.apache.jackrabbit.oak.spi.commit.CommitInfo
import org.apache.jackrabbit.oak.spi.commit.EmptyHook
import org.apache.jackrabbit.oak.spi.state.NodeStateUtils
import org.apache.jackrabbit.oak.spi.state.NodeStore
import org.apache.jackrabbit.oak.commons.PathUtils

def rmNode(def session, String path, boolean includingThis = true) {
    if(!includingThis) {
        println "Removing subnodes of ${path}"

        def ns = NodeStateUtils.getNode(session.getRoot(), path);
        for(def subNodeName : ns.getChildNodeNames()) {
            if(!subNodeName.equals("rep:policy")) {
                String subpath = path + "/" +subNodeName;
                rmNode(session, subpath);
            }
        }
    } else {
        println "Removing node ${path}"

        NodeStore ns = session.store
        def nb = ns.root.builder()

        def aBuilder = nb
        for(p in PathUtils.elements(path)) {
            aBuilder = aBuilder.getChildNode(p)
        }
        if(aBuilder.exists()) {
            rm = aBuilder.remove()
            ns.merge(nb, EmptyHook.INSTANCE, CommitInfo.EMPTY)
            return em
        } else {
            priln "Node ${path} doesn't exist"
            return false;
        }
    }
}

Create tools folder parallel to your script and add above rmNode.groovy and jar from here .

Hope that helps!

Regards,

Santosh

Santosh Sai

S

sdouglasmcSonovaAuthor

Level 4

Appreciated, but that isn't going to be what I'm looking for the way we have things set up.

joerghoh

Adobe Employee

Why does the datastore GC process not work for you? I am not aware of any other way to do it.

S

sdouglasmcSonovaAuthor

Level 4

It does work. That's not the issue I'm having. I notice that we have spiked about 150G somewhere in the past 2 months and I'm trying to figure out why. So, I'd like to figure out a way I can validate all the blobids in the FDS against the repositories and not the other way around. Do you know if the marksweep uses any indexes or just the blobid cache on the filesystem?

joerghoh

Adobe Employee

It does clean up binaries. It works just fine. We to nightly revision cleanup on all 4 (1 author, 3 publish).

It just seems really odd we are sitting at about 620G shared when our author instance is around 320G itself. I'm just looking to find a way of relating the blob ids back to their related AEM instance. Obviously, I would expect 95% of the blob ids in the publish instances to be shared among all the instances.

How do you know that your author has only 320GB? Did you use the DiskUsage report to determine this?

In any case: can you relate this growth of the Datastore to any event? Is it increasing every day (no matter what you are doing) by 1GB? Is it increasing when you deploy?

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded