Expand my Community achievements bar.

Don’t miss the AEM Skill Exchange in SF on Nov 14—hear from industry leaders, learn best practices, and enhance your AEM strategy with practical tips.

AEMaaCS customized damAssetLucene-11 index doesn't return results from PDFs

Avatar

Level 5

Hi everyone,

 

I had to customize OOTB index /oak:index/damAssetLucene-11 with stopwords, some new indexRules and suggestions for AEMaaCS. I built my customization on  /oak:index/damAssetLucene-11. I added tika.config.xml as well. Example of my index you can find below:

<damAssetLucene-11-custom-2
        jcr:primaryType="oak:QueryIndexDefinition"
        async="[async,nrt]"
        compatVersion="{Long}2"
        evaluatePathRestrictions="{Boolean}true"
        excludedPaths="[/some/path]"
        includedPaths="[/content/dam]"
        maxFieldLength="{Long}100000"
        tags="[visualSimilaritySearch,assetsOmnisearch]"
        type="lucene">
	<aggregates jcr:primaryType="nt:unstructured">
            ...
        </aggregates>
        <analyzers jcr:primaryType="nt:unstructured">
            <default jcr:primaryType="nt:unstructured">
                <tokenizer jcr:primaryType="nt:unstructured" name="Standard"/>
                <filters jcr:primaryType="nt:unstructured">
                    <LowerCase jcr:primaryType="nt:unstructured"/>
                    <Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]">
                        <stopwords.txt jcr:primaryType="nt:file">
                            <jcr:content jcr:primaryType="nt:unstructured"/>
                        </stopwords.txt>
                    </Stop>
                </filters>
            </default>
        </analyzers>
	<indexRules jcr:primaryType="nt:unstructured">
            ...
        </indexRules>
	<suggestion
            jcr:primaryType="nt:unstructured"
            suggestAnalyzed="{Boolean}true"
            suggestUpdateFrequencyMinutes="{Long}5"/>
	<tika jcr:primaryType="nt:unstructured">
		<config.xml jcr:primaryType="nt:file">
			<jcr:content jcr:primaryType="nt:unstructured"/>
		</config.xml>
	</tika>
</damAssetLucene-11-custom-2>

However, with this index I can't do search by PDF content. Queries return no results. 

Locally, if I remove tika/config.xml, index will return results. After the deployment to the AEMaaCS, index doesn't return PDF documents in results.

Query example: /jcr:root/content/dam/project/en/sitecontent/documents//element(*, dam:Asset)[(jcr:contains(., 'some text in the pdf*'))]/rep:excerpt(.)

By the way, after the deployment to the AEMaaCS I still have /oak:index/damAssetLucene-11-custom-1 and /oak:index/damAssetLucene-11 indexes enabled.

 

Do you have any ideas about potential root cause?

5 Replies

Avatar

Community Advisor

Hi @konstantyn_diachenko ,

Can you validate your index definition using below link 

 

https://oakutils.appspot.com/analyze/index

 

Thanks

 

 

Himanshu Jain

Avatar

Community Advisor

@konstantyn_diachenko ,

Can you also share the content from config.xml file under tika node.

 

Thanks

 

Himanshu Jain

Hi @Himanshu_Jain , thank you for your answer. 

I use tika/config.xml from /oak:index/damAssetLucene-11 on the AEMaaCS dev instance and it's equivalent to the one in this documentation (https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/operations/index...

<properties>
    <detectors>
        <detector class="org.apache.tika.detect.TypeDetector"/>
    </detectors>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        <mime>text/plain</mime>
        </parser>
    </parsers>
    <service-loader initializableProblemHandler="ignore" dynamic="true"/>
</properties>

 

I found the root cause. I installed assets via AEM package create on on prem AEM instance. After reprocessing of all assets, PDF content started to be searched. 

Avatar

Administrator

@konstantyn_diachenko Did you find the suggestion helpful? Please let us know if you require more information. Otherwise, please mark the answer as correct for posterity. If you've discovered a solution yourself, we would appreciate it if you could share it with the community. Thank you!



Kautuk Sahni