Hi everyone,
I had to customize OOTB index /oak:index/damAssetLucene-11 with stopwords, some new indexRules and suggestions for AEMaaCS. I built my customization on /oak:index/damAssetLucene-11. I added tika.config.xml as well. Example of my index you can find below:
<damAssetLucene-11-custom-2 jcr:primaryType="oak:QueryIndexDefinition" async="[async,nrt]" compatVersion="{Long}2" evaluatePathRestrictions="{Boolean}true" excludedPaths="[/some/path]" includedPaths="[/content/dam]" maxFieldLength="{Long}100000" tags="[visualSimilaritySearch,assetsOmnisearch]" type="lucene"> <aggregates jcr:primaryType="nt:unstructured"> ... </aggregates> <analyzers jcr:primaryType="nt:unstructured"> <default jcr:primaryType="nt:unstructured"> <tokenizer jcr:primaryType="nt:unstructured" name="Standard"/> <filters jcr:primaryType="nt:unstructured"> <LowerCase jcr:primaryType="nt:unstructured"/> <Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]"> <stopwords.txt jcr:primaryType="nt:file"> <jcr:content jcr:primaryType="nt:unstructured"/> </stopwords.txt> </Stop> </filters> </default> </analyzers> <indexRules jcr:primaryType="nt:unstructured"> ... </indexRules> <suggestion jcr:primaryType="nt:unstructured" suggestAnalyzed="{Boolean}true" suggestUpdateFrequencyMinutes="{Long}5"/> <tika jcr:primaryType="nt:unstructured"> <config.xml jcr:primaryType="nt:file"> <jcr:content jcr:primaryType="nt:unstructured"/> </config.xml> </tika> </damAssetLucene-11-custom-2>
However, with this index I can't do search by PDF content. Queries return no results.
Locally, if I remove tika/config.xml, index will return results. After the deployment to the AEMaaCS, index doesn't return PDF documents in results.
Query example: /jcr:root/content/dam/project/en/sitecontent/documents//element(*, dam:Asset)[(jcr:contains(., 'some text in the pdf*'))]/rep:excerpt(.)
By the way, after the deployment to the AEMaaCS I still have /oak:index/damAssetLucene-11-custom-1 and /oak:index/damAssetLucene-11 indexes enabled.
Do you have any ideas about potential root cause?
Total Likes
Can you validate your index definition using below link
Can you also share the content from config.xml file under tika node.
Hi @Himanshu_Jain , thank you for your answer.
I use tika/config.xml from /oak:index/damAssetLucene-11 on the AEMaaCS dev instance and it's equivalent to the one in this documentation (https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/operations/index...
<properties> <detectors> <detector class="org.apache.tika.detect.TypeDetector"/> </detectors> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <mime>text/plain</mime> </parser> </parsers> <service-loader initializableProblemHandler="ignore" dynamic="true"/> </properties>
I found the root cause. I installed assets via AEM package create on on prem AEM instance. After reprocessing of all assets, PDF content started to be searched.
@konstantyn_diachenko Did you find the suggestion helpful? Please let us know if you require more information. Otherwise, please mark the answer as correct for posterity. If you've discovered a solution yourself, we would appreciate it if you could share it with the community. Thank you!
Total Likes