Hi everyone,
I had to customize OOTB index /oak:index/damAssetLucene-11 with stopwords, some new indexRules and suggestions for AEMaaCS. I built my customization on /oak:index/damAssetLucene-11. I added tika.config.xml as well. Example of my index you can find below:
<damAssetLucene-11-custom-2
jcr:primaryType="oak:QueryIndexDefinition"
async="[async,nrt]"
compatVersion="{Long}2"
evaluatePathRestrictions="{Boolean}true"
excludedPaths="[/some/path]"
includedPaths="[/content/dam]"
maxFieldLength="{Long}100000"
tags="[visualSimilaritySearch,assetsOmnisearch]"
type="lucene">
<aggregates jcr:primaryType="nt:unstructured">
...
</aggregates>
<analyzers jcr:primaryType="nt:unstructured">
<default jcr:primaryType="nt:unstructured">
<tokenizer jcr:primaryType="nt:unstructured" name="Standard"/>
<filters jcr:primaryType="nt:unstructured">
<LowerCase jcr:primaryType="nt:unstructured"/>
<Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]">
<stopwords.txt jcr:primaryType="nt:file">
<jcr:content jcr:primaryType="nt:unstructured"/>
</stopwords.txt>
</Stop>
</filters>
</default>
</analyzers>
<indexRules jcr:primaryType="nt:unstructured">
...
</indexRules>
<suggestion
jcr:primaryType="nt:unstructured"
suggestAnalyzed="{Boolean}true"
suggestUpdateFrequencyMinutes="{Long}5"/>
<tika jcr:primaryType="nt:unstructured">
<config.xml jcr:primaryType="nt:file">
<jcr:content jcr:primaryType="nt:unstructured"/>
</config.xml>
</tika>
</damAssetLucene-11-custom-2>However, with this index I can't do search by PDF content. Queries return no results.
Locally, if I remove tika/config.xml, index will return results. After the deployment to the AEMaaCS, index doesn't return PDF documents in results.
Query example: /jcr:root/content/dam/project/en/sitecontent/documents//element(*, dam:Asset)[(jcr:contains(., 'some text in the pdf*'))]/rep:excerpt(.)
By the way, after the deployment to the AEMaaCS I still have /oak:index/damAssetLucene-11-custom-1 and /oak:index/damAssetLucene-11 indexes enabled.
Do you have any ideas about potential root cause?

Views
Replies
Total Likes
Can you validate your index definition using below link
https://oakutils.appspot.com/analyze/index
Thanks
Can you also share the content from config.xml file under tika node.
Thanks
Hi @Himanshu_Jain , thank you for your answer.
I use tika/config.xml from /oak:index/damAssetLucene-11 on the AEMaaCS dev instance and it's equivalent to the one in this documentation (https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/operations/index...😞
<properties>
<detectors>
<detector class="org.apache.tika.detect.TypeDetector"/>
</detectors>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime>text/plain</mime>
</parser>
</parsers>
<service-loader initializableProblemHandler="ignore" dynamic="true"/>
</properties>

I found the root cause. I installed assets via AEM package create on on prem AEM instance. After reprocessing of all assets, PDF content started to be searched.

@konstantyn_diachenko Did you find the suggestion helpful? Please let us know if you require more information. Otherwise, please mark the answer as correct for posterity. If you've discovered a solution yourself, we would appreciate it if you could share it with the community. Thank you!
Views
Replies
Total Likes