Hi everyone,
I had to customize OOTB index /oak:index/damAssetLucene-11 with stopwords, some new indexRules and suggestions for AEMaaCS. I built my customization on /oak:index/damAssetLucene-11. I added tika.config.xml as well. Example of my index you can find below:
<damAssetLucene-11-custom-2 jcr:primaryType="oak:QueryIndexDefinition" async="[async,nrt]" compatVersion="{Long}2" evaluatePathRestrictions="{Boolean}true" excludedPaths="[/some/path]" includedPaths="[/content/dam]" maxFieldLength="{Long}100000" tags="[visualSimilaritySearch,assetsOmnisearch]" type="lucene"> <aggregates jcr:primaryType="nt:unstructured"> ... </aggregates> <analyzers jcr:primaryType="nt:unstructured"> <default jcr:primaryType="nt:unstructured"> <tokenizer jcr:primaryType="nt:unstructured" name="Standard"/> <filters jcr:primaryType="nt:unstructured"> <LowerCase jcr:primaryType="nt:unstructured"/> <Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]"> <stopwords.txt jcr:primaryType="nt:file"> <jcr:content jcr:primaryType="nt:unstructured"/> </stopwords.txt> </Stop> </filters> </default> </analyzers> <indexRules jcr:primaryType="nt:unstructured"> ... </indexRules> <suggestion jcr:primaryType="nt:unstructured" suggestAnalyzed="{Boolean}true" suggestUpdateFrequencyMinutes="{Long}5"/> <tika jcr:primaryType="nt:unstructured"> <config.xml jcr:primaryType="nt:file"> <jcr:content jcr:primaryType="nt:unstructured"/> </config.xml> </tika> </damAssetLucene-11-custom-2>
However, with this index I can't do search by PDF content. Queries return no results.
Locally, if I remove tika/config.xml, index will return results. After the deployment to the AEMaaCS, index doesn't return PDF documents in results.
Query example: /jcr:root/content/dam/project/en/sitecontent/documents//element(*, dam:Asset)[(jcr:contains(., 'some text in the pdf*'))]/rep:excerpt(.)
By the way, after the deployment to the AEMaaCS I still have /oak:index/damAssetLucene-11-custom-1 and /oak:index/damAssetLucene-11 indexes enabled.
Do you have any ideas about potential root cause?
Views
Replies
Total Likes
Can you validate your index definition using below link
https://oakutils.appspot.com/analyze/index
Thanks
Can you also share the content from config.xml file under tika node.
Thanks
Hi @Himanshu_Jain , thank you for your answer.
I use tika/config.xml from /oak:index/damAssetLucene-11 on the AEMaaCS dev instance and it's equivalent to the one in this documentation (https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/operations/index...
<properties> <detectors> <detector class="org.apache.tika.detect.TypeDetector"/> </detectors> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <mime>text/plain</mime> </parser> </parsers> <service-loader initializableProblemHandler="ignore" dynamic="true"/> </properties>
I found the root cause. I installed assets via AEM package create on on prem AEM instance. After reprocessing of all assets, PDF content started to be searched.
@konstantyn_diachenko Did you find the suggestion helpful? Please let us know if you require more information. Otherwise, please mark the answer as correct for posterity. If you've discovered a solution yourself, we would appreciate it if you could share it with the community. Thank you!
Views
Replies
Total Likes
Views
Likes
Replies