Expand my Community achievements bar.

SOLVED

AEMaaCS upload author index step failing with stopwords file

Avatar

Level 7

Hi everyone,

 

I customized OOTB index /oak:index/damAssetLucene-11 and added support of stopwords there.

<damAssetLucene-11-custom-2
        jcr:primaryType="oak:QueryIndexDefinition"
        async="[async,nrt]"
        compatVersion="{Long}2"
        evaluatePathRestrictions="{Boolean}true"
        excludedPaths="[/some/path]"
        includedPaths="[/content/dam]"
        maxFieldLength="{Long}100000"
        tags="[visualSimilaritySearch,assetsOmnisearch]"
        type="lucene">
        <aggregates jcr:primaryType="nt:unstructured">
            ...
        </aggregates>
        <analyzers jcr:primaryType="nt:unstructured">
            <default
                jcr:primaryType="nt:unstructured"
                luceneMatchVersion="LUCENE_47"
                class="org.apache.lucene.analysis.standard.StandardAnalyzer">
                <stopwords jcr:primaryType="nt:file">
                    <jcr:content jcr:primaryType="nt:unstructured" jcr:mimeType="text/plain"/>
                </stopwords>
            </default>
        </analyzers>
        <indexRules jcr:primaryType="nt:unstructured">
            ...
        </indexRules>
        <suggestion
            jcr:primaryType="nt:unstructured"
            suggestAnalyzed="{Boolean}true"
            suggestUpdateFrequencyMinutes="{Long}5"/>
        <tika jcr:primaryType="nt:unstructured">
            <config.xml jcr:primaryType="nt:file">
                <jcr:content jcr:primaryType="nt:unstructured"/>
            </config.xml>
        </tika>
    </damAssetLucene-11-custom-2>

I have the following structure in the project:
_oak_index
|__damAssetLecene-11-custom-2
     |__analyzers
     |    |__default
     |         |__stopwords.dir
     |         |    |__.content.xml

     |         |__stopwords
     |__tika

          |__config.xml

 

stopwords.dir/.content.xml

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
    jcr:primaryType="nt:file">
    <jcr:content
        jcr:encoding="utf-8"
        jcr:mimeType="text/plain"
        jcr:primaryType="nt:resource"/>
</jcr:root>

 

tika.xml is uploaded without any problem to AEMaaCS instance, but stopwords isn't. 

In the Cloud Manager Pipeline logs I see the following error:

23:31:55.820 [main] INFO  o.a.j.o.p.i.i.IndexDefinitionUpdater - Adding new index definition at path [/oak:index/damAssetLucene-11-custom-2]
23:31:55.853 [main] INFO  o.a.j.oak.index.IndexerSupport - Switched the async lane for indexes at [/oak:index/cqPageLucene-0-custom-2, /oak:index/damAssetLucene-11-custom-2] to offline-reindex-async and marked them for reindex
23:31:55.879 [main] INFO  o.a.j.oak.index.LuceneIndexHelper - Setting RAMBufferSize for LuceneIndexWriter (configurable via system property 'oak.index.ramBufferSizeMB') to 32 MB
23:31:55.917 [main] INFO  o.a.j.o.p.i.s.s.e.FulltextIndexEditorContext - Stored the cloned index definition for [/oak:index/cqPageLucene-0-custom-2]. Changes in index definition would now only be effective post reindexing
23:31:55.917 [main] INFO  o.a.j.o.p.i.s.s.e.FulltextIndexEditorContext - IndexDefinition creation timestamp added for [/oak:index/cqPageLucene-0-custom-2]
23:31:56.078 [main] ERROR c.adobe.granite.indexing.tool.Main - Can't perform operation
java.lang.NullPointerException: Cannot invoke "org.apache.jackrabbit.oak.api.Blob.getNewStream()" because "blob" is null
	at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.loadStopwordSet(NodeStateAnalyzerFactory.java:247)
	at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.createAnalyzerViaReflection(NodeStateAnalyzerFactory.java:161)
	at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.createInstance(NodeStateAnalyzerFactory.java:97)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition.collectAnalyzers(LuceneIndexDefinition.java:167)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition.<init>(LuceneIndexDefinition.java:76)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.createInstance(LuceneIndexDefinition.java:102)
	at org.apache.jackrabbit.oak.plugins.index.search.IndexDefinition$Builder.build(IndexDefinition.java:410)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.build(LuceneIndexDefinition.java:91)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.build(LuceneIndexDefinition.java:88)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.createIndexDefinition(FulltextIndexEditorContext.java:310)
	at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.<init>(FulltextIndexEditorContext.java:107)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.<init>(LuceneIndexEditorContext.java:48)
	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorProvider.getIndexEditor(LuceneIndexEditorProvider.java:236)
	at org.apache.jackrabbit.oak.plugins.index.CompositeIndexEditorProvider.getIndexEditor(CompositeIndexEditorProvider.java:73)
	at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.collectIndexEditors(IndexUpdate.java:322)
	at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.enter(IndexUpdate.java:178)
	at org.apache.jackrabbit.oak.spi.commit.VisibleEditor.enter(VisibleEditor.java:53)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:48)
	at org.apache.jackrabbit.oak.index.OutOfBandIndexerBase.preformIndexUpdate(OutOfBandIndexerBase.java:126)
	at org.apache.jackrabbit.oak.index.OutOfBandIndexerBase.reindex(OutOfBandIndexerBase.java:77)
	at com.adobe.granite.indexing.tool.ReindexCmd.index(ReindexCmd.java:244)
	at com.adobe.granite.indexing.tool.ReindexCmd.run(ReindexCmd.java:141)
	at com.adobe.granite.indexing.tool.Main.execute(Main.java:174)
	at com.adobe.granite.indexing.tool.Main.main(Main.java:77)

This is because index definition for this index doesn't contain jcr:data property, however for tika/config.xml it's present.

{
   "analyzers":{
      "jcr:primaryType":"nam:nt:unstructured",
      "default":{
         "jcr:primaryType":"nam:nt:unstructured",
         "luceneMatchVersion":"LUCENE_47",
         "class":"org.apache.lucene.analysis.standard.StandardAnalyzer",
         "stopwords":{
            "jcr:primaryType":"nam:nt:file",
            "jcr:content":{
               "jcr:encoding":"utf-8",
               "jcr:mimeType":"text/plain",
               "jcr:primaryType":"nam:nt:resource"
            }
         }
      }
   }
}

 

Could you please help me to fix FileVault representation of index nodes to make it compatible with AEMaaCS pipelines? Locally it's working fine.

Topics

Topics help categorize Community content and increase your ability to discover relevant content.

1 Accepted Solution

Avatar

Correct answer by
Community Advisor
2 Replies

Avatar

Correct answer by
Community Advisor

Hi @Himanshu_Jain, thank you for your answer. I came to this solution as well and it fixed my problem.

The result:

<damAssetLucene-11-custom-2
        jcr:primaryType="oak:QueryIndexDefinition"
        async="[async,nrt]"
        compatVersion="{Long}2"
        evaluatePathRestrictions="{Boolean}true"
        excludedPaths="[/some/path]"
        includedPaths="[/content/dam]"
        maxFieldLength="{Long}100000"
        tags="[visualSimilaritySearch,assetsOmnisearch]"
        type="lucene">
	<aggregates jcr:primaryType="nt:unstructured">
            ...
        </aggregates>
        <analyzers jcr:primaryType="nt:unstructured">
            <default jcr:primaryType="nt:unstructured">
                <tokenizer jcr:primaryType="nt:unstructured" name="Standard"/>
                <filters jcr:primaryType="nt:unstructured">
                    <LowerCase jcr:primaryType="nt:unstructured"/>
                    <Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]">
                        <stopwords.txt jcr:primaryType="nt:file">
                            <jcr:content jcr:primaryType="nt:unstructured"/>
                        </stopwords.txt>
                    </Stop>
                </filters>
            </default>
        </analyzers>
	<indexRules jcr:primaryType="nt:unstructured">
            ...
        </indexRules>
	<suggestion
            jcr:primaryType="nt:unstructured"
            suggestAnalyzed="{Boolean}true"
            suggestUpdateFrequencyMinutes="{Long}5"/>
	<tika jcr:primaryType="nt:unstructured">
		<config.xml jcr:primaryType="nt:file">
			<jcr:content jcr:primaryType="nt:unstructured"/>
		</config.xml>
	</tika>
</damAssetLucene-11-custom-2>

The files structure:

I have the following structure in the project:
_oak_index
|__damAssetLecene-11-custom-2
     |__analyzers
     |    |__default
     |         |__filters
     |              |__Stop

     |                   |__stopwords.txt
     |__tika

          |__config.xml