Adobe Experience Manager Sites & More

konstantyn_diachenko · 8/28/24

Hi everyone,

I customized OOTB index /oak:index/damAssetLucene-11 and added support of stopwords there.

<damAssetLucene-11-custom-2
        jcr:primaryType="oak:QueryIndexDefinition"
        async="[async,nrt]"
        compatVersion="{Long}2"
        evaluatePathRestrictions="{Boolean}true"
        excludedPaths="[/some/path]"
        includedPaths="[/content/dam]"
        maxFieldLength="{Long}100000"
        tags="[visualSimilaritySearch,assetsOmnisearch]"
        type="lucene">
        <aggregates jcr:primaryType="nt:unstructured">
            ...
        </aggregates>
        <analyzers jcr:primaryType="nt:unstructured">
            <default
                jcr:primaryType="nt:unstructured"
                luceneMatchVersion="LUCENE_47"
                class="org.apache.lucene.analysis.standard.StandardAnalyzer">
                <stopwords jcr:primaryType="nt:file">
                    <jcr:content jcr:primaryType="nt:unstructured" jcr:mimeType="text/plain"/>
                </stopwords>
            </default>
        </analyzers>
        <indexRules jcr:primaryType="nt:unstructured">
            ...
        </indexRules>
        <suggestion
            jcr:primaryType="nt:unstructured"
            suggestAnalyzed="{Boolean}true"
            suggestUpdateFrequencyMinutes="{Long}5"/>
        <tika jcr:primaryType="nt:unstructured">
            <config.xml jcr:primaryType="nt:file">
                <jcr:content jcr:primaryType="nt:unstructured"/>
            </config.xml>
        </tika>
    </damAssetLucene-11-custom-2>

| |__stopwords
|__tika

|__config.xml

stopwords.dir/.content.xml

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
    jcr:primaryType="nt:file">
    <jcr:content
        jcr:encoding="utf-8"
        jcr:mimeType="text/plain"
        jcr:primaryType="nt:resource"/>
</jcr:root>

tika.xml is uploaded without any problem to AEMaaCS instance, but stopwords isn't.

In the Cloud Manager Pipeline logs I see the following error:

23:31:55.820 [main] INFO  o.a.j.o.p.i.i.IndexDefinitionUpdater - Adding new index definition at path [/oak:index/damAssetLucene-11-custom-2]
23:31:55.853 [main] INFO  o.a.j.oak.index.IndexerSupport - Switched the async lane for indexes at [/oak:index/cqPageLucene-0-custom-2, /oak:index/damAssetLucene-11-custom-2] to offline-reindex-async and marked them for reindex
23:31:55.879 [main] INFO  o.a.j.oak.index.LuceneIndexHelper - Setting RAMBufferSize for LuceneIndexWriter (configurable via system property 'oak.index.ramBufferSizeMB') to 32 MB
23:31:55.917 [main] INFO  o.a.j.o.p.i.s.s.e.FulltextIndexEditorContext - Stored the cloned index definition for [/oak:index/cqPageLucene-0-custom-2]. Changes in index definition would now only be effective post reindexing
23:31:55.917 [main] INFO  o.a.j.o.p.i.s.s.e.FulltextIndexEditorContext - IndexDefinition creation timestamp added for [/oak:index/cqPageLucene-0-custom-2]
23:31:56.078 [main] ERROR c.adobe.granite.indexing.tool.Main - Can't perform operation
java.lang.NullPointerException: Cannot invoke "org.apache.jackrabbit.oak.api.Blob.getNewStream()" because "blob" is null
	at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.loadStopwordSet(NodeStateAnalyzerFactory.java:247)
	at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.createAnalyzerViaReflection(NodeStateAnalyzerFactory.java:161)
	at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.createInstance(NodeStateAnalyzerFactory.java:97)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition.collectAnalyzers(LuceneIndexDefinition.java:167)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition.<init>(LuceneIndexDefinition.java:76)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.createInstance(LuceneIndexDefinition.java:102)
	at org.apache.jackrabbit.oak.plugins.index.search.IndexDefinition$Builder.build(IndexDefinition.java:410)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.build(LuceneIndexDefinition.java:91)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.build(LuceneIndexDefinition.java:88)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.createIndexDefinition(FulltextIndexEditorContext.java:310)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.<init>(FulltextIndexEditorContext.java:107)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.<init>(LuceneIndexEditorContext.java:48)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorProvider.getIndexEditor(LuceneIndexEditorProvider.java:236)
	at org.apache.jackrabbit.oak.plugins.index.CompositeIndexEditorProvider.getIndexEditor(CompositeIndexEditorProvider.java:73)
	at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.collectIndexEditors(IndexUpdate.java:322)
	at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.enter(IndexUpdate.java:178)
	at org.apache.jackrabbit.oak.spi.commit.VisibleEditor.enter(VisibleEditor.java:53)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:48)
	at org.apache.jackrabbit.oak.index.OutOfBandIndexerBase.preformIndexUpdate(OutOfBandIndexerBase.java:126)
	at org.apache.jackrabbit.oak.index.OutOfBandIndexerBase.reindex(OutOfBandIndexerBase.java:77)
	at com.adobe.granite.indexing.tool.ReindexCmd.index(ReindexCmd.java:244)
	at com.adobe.granite.indexing.tool.ReindexCmd.run(ReindexCmd.java:141)
	at com.adobe.granite.indexing.tool.Main.execute(Main.java:174)
	at com.adobe.granite.indexing.tool.Main.main(Main.java:77)

This is because index definition for this index doesn't contain jcr:data property, however for tika/config.xml it's present.

{
   "analyzers":{
      "jcr:primaryType":"nam:nt:unstructured",
      "default":{
         "jcr:primaryType":"nam:nt:unstructured",
         "luceneMatchVersion":"LUCENE_47",
         "class":"org.apache.lucene.analysis.standard.StandardAnalyzer",
         "stopwords":{
            "jcr:primaryType":"nam:nt:file",
            "jcr:content":{
               "jcr:encoding":"utf-8",
               "jcr:mimeType":"text/plain",
               "jcr:primaryType":"nam:nt:resource"
            }
         }
      }
   }
}

Could you please help me to fix FileVault representation of index nodes to make it compatible with AEMaaCS pipelines? Locally it's working fine.

Himanshu_Jain · 8/28/24

Hi @konstantyn_diachenko ,

Kindly refer for stopwords implementation

https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/custom-oak-indexes-aem-par...

Thanks

Himanshu Jain

View solution in original post

Himanshu_Jain · 8/28/24

Hi @konstantyn_diachenko ,

Kindly refer for stopwords implementation

https://experienceleaguecommunities.adobe.com/t5/adobe-experience-manager/custom-oak-indexes-aem-par...

Thanks

Himanshu Jain

konstantyn_diachenko · 8/29/24

Hi @Himanshu_Jain, thank you for your answer. I came to this solution as well and it fixed my problem.

The result:

<damAssetLucene-11-custom-2
        jcr:primaryType="oak:QueryIndexDefinition"
        async="[async,nrt]"
        compatVersion="{Long}2"
        evaluatePathRestrictions="{Boolean}true"
        excludedPaths="[/some/path]"
        includedPaths="[/content/dam]"
        maxFieldLength="{Long}100000"
        tags="[visualSimilaritySearch,assetsOmnisearch]"
        type="lucene">
	<aggregates jcr:primaryType="nt:unstructured">
            ...
        </aggregates>
        <analyzers jcr:primaryType="nt:unstructured">
            <default jcr:primaryType="nt:unstructured">
                <tokenizer jcr:primaryType="nt:unstructured" name="Standard"/>
                <filters jcr:primaryType="nt:unstructured">
                    <LowerCase jcr:primaryType="nt:unstructured"/>
                    <Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]">
                        <stopwords.txt jcr:primaryType="nt:file">
                            <jcr:content jcr:primaryType="nt:unstructured"/>
                        </stopwords.txt>
                    </Stop>
                </filters>
            </default>
        </analyzers>
	<indexRules jcr:primaryType="nt:unstructured">
            ...
        </indexRules>
	<suggestion
            jcr:primaryType="nt:unstructured"
            suggestAnalyzed="{Boolean}true"
            suggestUpdateFrequencyMinutes="{Long}5"/>
	<tika jcr:primaryType="nt:unstructured">
		<config.xml jcr:primaryType="nt:file">
			<jcr:content jcr:primaryType="nt:unstructured"/>
		</config.xml>
	</tika>
</damAssetLucene-11-custom-2>

The files structure:

| |__stopwords.txt
|__tika

|__config.xml

Adobe Experience Manager Sites & More

AEMaaCS upload author index step failing with stopwords file

Learn

Documentation

Community

Support

Resources

Adobe account

Adobe