AEMaaCS upload author index step failing with stopwords file | Adobe Higher Education
Skip to main content
konstantyn_diachenko
Community Advisor
Community Advisor
August 28, 2024
Respondido

AEMaaCS upload author index step failing with stopwords file

  • August 28, 2024
  • 1 resposta
  • 558 Visualizações

Hi everyone,

 

I customized OOTB index /oak:index/damAssetLucene-11 and added support of stopwords there.

<damAssetLucene-11-custom-2 jcr:primaryType="oak:QueryIndexDefinition" async="[async,nrt]" compatVersion="{Long}2" evaluatePathRestrictions="{Boolean}true" excludedPaths="[/some/path]" includedPaths="[/content/dam]" maxFieldLength="{Long}100000" tags="[visualSimilaritySearch,assetsOmnisearch]" type="lucene"> <aggregates jcr:primaryType="nt:unstructured"> ... </aggregates> <analyzers jcr:primaryType="nt:unstructured"> <default jcr:primaryType="nt:unstructured" luceneMatchVersion="LUCENE_47" class="org.apache.lucene.analysis.standard.StandardAnalyzer"> <stopwords jcr:primaryType="nt:file"> <jcr:content jcr:primaryType="nt:unstructured" jcr:mimeType="text/plain"/> </stopwords> </default> </analyzers> <indexRules jcr:primaryType="nt:unstructured"> ... </indexRules> <suggestion jcr:primaryType="nt:unstructured" suggestAnalyzed="{Boolean}true" suggestUpdateFrequencyMinutes="{Long}5"/> <tika jcr:primaryType="nt:unstructured"> <config.xml jcr:primaryType="nt:file"> <jcr:content jcr:primaryType="nt:unstructured"/> </config.xml> </tika> </damAssetLucene-11-custom-2>

I have the following structure in the project:
_oak_index
|__damAssetLecene-11-custom-2
     |__analyzers
     |    |__default
     |         |__stopwords.dir
     |         |    |__.content.xml

     |         |__stopwords
     |__tika

          |__config.xml

 

stopwords.dir/.content.xml

<?xml version="1.0" encoding="UTF-8"?> <jcr:root xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0" jcr:primaryType="nt:file"> <jcr:content jcr:encoding="utf-8" jcr:mimeType="text/plain" jcr:primaryType="nt:resource"/> </jcr:root>

 

tika.xml is uploaded without any problem to AEMaaCS instance, but stopwords isn't. 

In the Cloud Manager Pipeline logs I see the following error:

23:31:55.820 [main] INFO o.a.j.o.p.i.i.IndexDefinitionUpdater - Adding new index definition at path [/oak:index/damAssetLucene-11-custom-2] 23:31:55.853 [main] INFO o.a.j.oak.index.IndexerSupport - Switched the async lane for indexes at [/oak:index/cqPageLucene-0-custom-2, /oak:index/damAssetLucene-11-custom-2] to offline-reindex-async and marked them for reindex 23:31:55.879 [main] INFO o.a.j.oak.index.LuceneIndexHelper - Setting RAMBufferSize for LuceneIndexWriter (configurable via system property 'oak.index.ramBufferSizeMB') to 32 MB 23:31:55.917 [main] INFO o.a.j.o.p.i.s.s.e.FulltextIndexEditorContext - Stored the cloned index definition for [/oak:index/cqPageLucene-0-custom-2]. Changes in index definition would now only be effective post reindexing 23:31:55.917 [main] INFO o.a.j.o.p.i.s.s.e.FulltextIndexEditorContext - IndexDefinition creation timestamp added for [/oak:index/cqPageLucene-0-custom-2] 23:31:56.078 [main] ERROR c.adobe.granite.indexing.tool.Main - Can't perform operation java.lang.NullPointerException: Cannot invoke "org.apache.jackrabbit.oak.api.Blob.getNewStream()" because "blob" is null at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.loadStopwordSet(NodeStateAnalyzerFactory.java:247) at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.createAnalyzerViaReflection(NodeStateAnalyzerFactory.java:161) at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.createInstance(NodeStateAnalyzerFactory.java:97) at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition.collectAnalyzers(LuceneIndexDefinition.java:167) at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition.<init>(LuceneIndexDefinition.java:76) at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.createInstance(LuceneIndexDefinition.java:102) at org.apache.jackrabbit.oak.plugins.index.search.IndexDefinition$Builder.build(IndexDefinition.java:410) at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.build(LuceneIndexDefinition.java:91) at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.build(LuceneIndexDefinition.java:88) at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.createIndexDefinition(FulltextIndexEditorContext.java:310) at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.<init>(FulltextIndexEditorContext.java:107) at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.<init>(LuceneIndexEditorContext.java:48) at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorProvider.getIndexEditor(LuceneIndexEditorProvider.java:236) at org.apache.jackrabbit.oak.plugins.index.CompositeIndexEditorProvider.getIndexEditor(CompositeIndexEditorProvider.java:73) at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.collectIndexEditors(IndexUpdate.java:322) at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.enter(IndexUpdate.java:178) at org.apache.jackrabbit.oak.spi.commit.VisibleEditor.enter(VisibleEditor.java:53) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:48) at org.apache.jackrabbit.oak.index.OutOfBandIndexerBase.preformIndexUpdate(OutOfBandIndexerBase.java:126) at org.apache.jackrabbit.oak.index.OutOfBandIndexerBase.reindex(OutOfBandIndexerBase.java:77) at com.adobe.granite.indexing.tool.ReindexCmd.index(ReindexCmd.java:244) at com.adobe.granite.indexing.tool.ReindexCmd.run(ReindexCmd.java:141) at com.adobe.granite.indexing.tool.Main.execute(Main.java:174) at com.adobe.granite.indexing.tool.Main.main(Main.java:77)

This is because index definition for this index doesn't contain jcr:data property, however for tika/config.xml it's present.

{ "analyzers":{ "jcr:primaryType":"nam:nt:unstructured", "default":{ "jcr:primaryType":"nam:nt:unstructured", "luceneMatchVersion":"LUCENE_47", "class":"org.apache.lucene.analysis.standard.StandardAnalyzer", "stopwords":{ "jcr:primaryType":"nam:nt:file", "jcr:content":{ "jcr:encoding":"utf-8", "jcr:mimeType":"text/plain", "jcr:primaryType":"nam:nt:resource" } } } } }

 

Could you please help me to fix FileVault representation of index nodes to make it compatible with AEMaaCS pipelines? Locally it's working fine.

Este tópico foi fechado para respostas.

1 Resposta

Himanshu_Jain
Community Advisor
Community Advisor
August 29, 2024
konstantyn_diachenko
Community Advisor
Community Advisor
August 29, 2024

Hi @himanshu_jain, thank you for your answer. I came to this solution as well and it fixed my problem.

The result:

<damAssetLucene-11-custom-2 jcr:primaryType="oak:QueryIndexDefinition" async="[async,nrt]" compatVersion="{Long}2" evaluatePathRestrictions="{Boolean}true" excludedPaths="[/some/path]" includedPaths="[/content/dam]" maxFieldLength="{Long}100000" tags="[visualSimilaritySearch,assetsOmnisearch]" type="lucene"> <aggregates jcr:primaryType="nt:unstructured"> ... </aggregates> <analyzers jcr:primaryType="nt:unstructured"> <default jcr:primaryType="nt:unstructured"> <tokenizer jcr:primaryType="nt:unstructured" name="Standard"/> <filters jcr:primaryType="nt:unstructured"> <LowerCase jcr:primaryType="nt:unstructured"/> <Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]"> <stopwords.txt jcr:primaryType="nt:file"> <jcr:content jcr:primaryType="nt:unstructured"/> </stopwords.txt> </Stop> </filters> </default> </analyzers> <indexRules jcr:primaryType="nt:unstructured"> ... </indexRules> <suggestion jcr:primaryType="nt:unstructured" suggestAnalyzed="{Boolean}true" suggestUpdateFrequencyMinutes="{Long}5"/> <tika jcr:primaryType="nt:unstructured"> <config.xml jcr:primaryType="nt:file"> <jcr:content jcr:primaryType="nt:unstructured"/> </config.xml> </tika> </damAssetLucene-11-custom-2>

The files structure:

I have the following structure in the project:
_oak_index
|__damAssetLecene-11-custom-2
     |__analyzers
     |    |__default
     |         |__filters
     |              |__Stop

     |                   |__stopwords.txt
     |__tika

          |__config.xml

Kostiantyn Diachenko, Community Advisor, Certified Senior AEM Developer, creator of free AEM VLT Tool, maintainer of AEM Tools plugin.