Hi everyone,
I customized OOTB index /oak:index/damAssetLucene-11 and added support of stopwords there.
<damAssetLucene-11-custom-2
jcr:primaryType="oak:QueryIndexDefinition"
async="[async,nrt]"
compatVersion="{Long}2"
evaluatePathRestrictions="{Boolean}true"
excludedPaths="[/some/path]"
includedPaths="[/content/dam]"
maxFieldLength="{Long}100000"
tags="[visualSimilaritySearch,assetsOmnisearch]"
type="lucene">
<aggregates jcr:primaryType="nt:unstructured">
...
</aggregates>
<analyzers jcr:primaryType="nt:unstructured">
<default
jcr:primaryType="nt:unstructured"
luceneMatchVersion="LUCENE_47"
class="org.apache.lucene.analysis.standard.StandardAnalyzer">
<stopwords jcr:primaryType="nt:file">
<jcr:content jcr:primaryType="nt:unstructured" jcr:mimeType="text/plain"/>
</stopwords>
</default>
</analyzers>
<indexRules jcr:primaryType="nt:unstructured">
...
</indexRules>
<suggestion
jcr:primaryType="nt:unstructured"
suggestAnalyzed="{Boolean}true"
suggestUpdateFrequencyMinutes="{Long}5"/>
<tika jcr:primaryType="nt:unstructured">
<config.xml jcr:primaryType="nt:file">
<jcr:content jcr:primaryType="nt:unstructured"/>
</config.xml>
</tika>
</damAssetLucene-11-custom-2>
I have the following structure in the project:
_oak_index
|__damAssetLecene-11-custom-2
|__analyzers
| |__default
| |__stopwords.dir
| | |__.content.xml
| |__stopwords
|__tika
|__config.xml
stopwords.dir/.content.xml
<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
jcr:primaryType="nt:file">
<jcr:content
jcr:encoding="utf-8"
jcr:mimeType="text/plain"
jcr:primaryType="nt:resource"/>
</jcr:root>
tika.xml is uploaded without any problem to AEMaaCS instance, but stopwords isn't.
In the Cloud Manager Pipeline logs I see the following error:
23:31:55.820 [main] INFO o.a.j.o.p.i.i.IndexDefinitionUpdater - Adding new index definition at path [/oak:index/damAssetLucene-11-custom-2]
23:31:55.853 [main] INFO o.a.j.oak.index.IndexerSupport - Switched the async lane for indexes at [/oak:index/cqPageLucene-0-custom-2, /oak:index/damAssetLucene-11-custom-2] to offline-reindex-async and marked them for reindex
23:31:55.879 [main] INFO o.a.j.oak.index.LuceneIndexHelper - Setting RAMBufferSize for LuceneIndexWriter (configurable via system property 'oak.index.ramBufferSizeMB') to 32 MB
23:31:55.917 [main] INFO o.a.j.o.p.i.s.s.e.FulltextIndexEditorContext - Stored the cloned index definition for [/oak:index/cqPageLucene-0-custom-2]. Changes in index definition would now only be effective post reindexing
23:31:55.917 [main] INFO o.a.j.o.p.i.s.s.e.FulltextIndexEditorContext - IndexDefinition creation timestamp added for [/oak:index/cqPageLucene-0-custom-2]
23:31:56.078 [main] ERROR c.adobe.granite.indexing.tool.Main - Can't perform operation
java.lang.NullPointerException: Cannot invoke "org.apache.jackrabbit.oak.api.Blob.getNewStream()" because "blob" is null
at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.loadStopwordSet(NodeStateAnalyzerFactory.java:247)
at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.createAnalyzerViaReflection(NodeStateAnalyzerFactory.java:161)
at org.apache.jackrabbit.oak.plugins.index.lucene.NodeStateAnalyzerFactory.createInstance(NodeStateAnalyzerFactory.java:97)
at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition.collectAnalyzers(LuceneIndexDefinition.java:167)
at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition.<init>(LuceneIndexDefinition.java:76)
at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.createInstance(LuceneIndexDefinition.java:102)
at org.apache.jackrabbit.oak.plugins.index.search.IndexDefinition$Builder.build(IndexDefinition.java:410)
at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.build(LuceneIndexDefinition.java:91)
at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexDefinition$Builder.build(LuceneIndexDefinition.java:88)
at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.createIndexDefinition(FulltextIndexEditorContext.java:310)
at org.apache.jackrabbit.oak.plugins.index.search.spi.editor.FulltextIndexEditorContext.<init>(FulltextIndexEditorContext.java:107)
at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.<init>(LuceneIndexEditorContext.java:48)
at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorProvider.getIndexEditor(LuceneIndexEditorProvider.java:236)
at org.apache.jackrabbit.oak.plugins.index.CompositeIndexEditorProvider.getIndexEditor(CompositeIndexEditorProvider.java:73)
at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.collectIndexEditors(IndexUpdate.java:322)
at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.enter(IndexUpdate.java:178)
at org.apache.jackrabbit.oak.spi.commit.VisibleEditor.enter(VisibleEditor.java:53)
at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:48)
at org.apache.jackrabbit.oak.index.OutOfBandIndexerBase.preformIndexUpdate(OutOfBandIndexerBase.java:126)
at org.apache.jackrabbit.oak.index.OutOfBandIndexerBase.reindex(OutOfBandIndexerBase.java:77)
at com.adobe.granite.indexing.tool.ReindexCmd.index(ReindexCmd.java:244)
at com.adobe.granite.indexing.tool.ReindexCmd.run(ReindexCmd.java:141)
at com.adobe.granite.indexing.tool.Main.execute(Main.java:174)
at com.adobe.granite.indexing.tool.Main.main(Main.java:77)
This is because index definition for this index doesn't contain jcr:data property, however for tika/config.xml it's present.
{
"analyzers":{
"jcr:primaryType":"nam:nt:unstructured",
"default":{
"jcr:primaryType":"nam:nt:unstructured",
"luceneMatchVersion":"LUCENE_47",
"class":"org.apache.lucene.analysis.standard.StandardAnalyzer",
"stopwords":{
"jcr:primaryType":"nam:nt:file",
"jcr:content":{
"jcr:encoding":"utf-8",
"jcr:mimeType":"text/plain",
"jcr:primaryType":"nam:nt:resource"
}
}
}
}
}
Could you please help me to fix FileVault representation of index nodes to make it compatible with AEMaaCS pipelines? Locally it's working fine.
Solved! Go to Solution.
Topics help categorize Community content and increase your ability to discover relevant content.
Views
Replies
Total Likes
Kindly refer for stopwords implementation
Thanks
Kindly refer for stopwords implementation
Thanks
Hi @Himanshu_Jain, thank you for your answer. I came to this solution as well and it fixed my problem.
The result:
<damAssetLucene-11-custom-2
jcr:primaryType="oak:QueryIndexDefinition"
async="[async,nrt]"
compatVersion="{Long}2"
evaluatePathRestrictions="{Boolean}true"
excludedPaths="[/some/path]"
includedPaths="[/content/dam]"
maxFieldLength="{Long}100000"
tags="[visualSimilaritySearch,assetsOmnisearch]"
type="lucene">
<aggregates jcr:primaryType="nt:unstructured">
...
</aggregates>
<analyzers jcr:primaryType="nt:unstructured">
<default jcr:primaryType="nt:unstructured">
<tokenizer jcr:primaryType="nt:unstructured" name="Standard"/>
<filters jcr:primaryType="nt:unstructured">
<LowerCase jcr:primaryType="nt:unstructured"/>
<Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]">
<stopwords.txt jcr:primaryType="nt:file">
<jcr:content jcr:primaryType="nt:unstructured"/>
</stopwords.txt>
</Stop>
</filters>
</default>
</analyzers>
<indexRules jcr:primaryType="nt:unstructured">
...
</indexRules>
<suggestion
jcr:primaryType="nt:unstructured"
suggestAnalyzed="{Boolean}true"
suggestUpdateFrequencyMinutes="{Long}5"/>
<tika jcr:primaryType="nt:unstructured">
<config.xml jcr:primaryType="nt:file">
<jcr:content jcr:primaryType="nt:unstructured"/>
</config.xml>
</tika>
</damAssetLucene-11-custom-2>
The files structure:
I have the following structure in the project:
_oak_index
|__damAssetLecene-11-custom-2
|__analyzers
| |__default
| |__filters
| |__Stop
| |__stopwords.txt
|__tika
|__config.xml
Views
Likes
Replies