Expand my Community achievements bar.

AEMaaCS customized damAssetLucene-11 index doesn't return results from PDFs

Avatar

Level 5

Hi everyone,

 

I had to customize OOTB index /oak:index/damAssetLucene-11 with stopwords, some new indexRules and suggestions for AEMaaCS. I built my customization on  /oak:index/damAssetLucene-11. I added tika.config.xml as well. Example of my index you can find below:

<damAssetLucene-11-custom-2
        jcr:primaryType="oak:QueryIndexDefinition"
        async="[async,nrt]"
        compatVersion="{Long}2"
        evaluatePathRestrictions="{Boolean}true"
        excludedPaths="[/some/path]"
        includedPaths="[/content/dam]"
        maxFieldLength="{Long}100000"
        tags="[visualSimilaritySearch,assetsOmnisearch]"
        type="lucene">
	<aggregates jcr:primaryType="nt:unstructured">
            ...
        </aggregates>
        <analyzers jcr:primaryType="nt:unstructured">
            <default jcr:primaryType="nt:unstructured">
                <tokenizer jcr:primaryType="nt:unstructured" name="Standard"/>
                <filters jcr:primaryType="nt:unstructured">
                    <LowerCase jcr:primaryType="nt:unstructured"/>
                    <Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]">
                        <stopwords.txt jcr:primaryType="nt:file">
                            <jcr:content jcr:primaryType="nt:unstructured"/>
                        </stopwords.txt>
                    </Stop>
                </filters>
            </default>
        </analyzers>
	<indexRules jcr:primaryType="nt:unstructured">
            ...
        </indexRules>
	<suggestion
            jcr:primaryType="nt:unstructured"
            suggestAnalyzed="{Boolean}true"
            suggestUpdateFrequencyMinutes="{Long}5"/>
	<tika jcr:primaryType="nt:unstructured">
		<config.xml jcr:primaryType="nt:file">
			<jcr:content jcr:primaryType="nt:unstructured"/>
		</config.xml>
	</tika>
</damAssetLucene-11-custom-2>

However, with this index I can't do search by PDF content. Queries return no results. 

Locally, if I remove tika/config.xml, index will return results. After the deployment to the AEMaaCS, index doesn't return PDF documents in results.

Query example: /jcr:root/content/dam/project/en/sitecontent/documents//element(*, dam:Asset)[(jcr:contains(., 'some text in the pdf*'))]/rep:excerpt(.)

By the way, after the deployment to the AEMaaCS I still have /oak:index/damAssetLucene-11-custom-1 and /oak:index/damAssetLucene-11 indexes enabled.

 

Do you have any ideas about potential root cause?

5 Replies

Avatar

Community Advisor

Hi @konstantyn_diachenko ,

Can you validate your index definition using below link 

 

https://oakutils.appspot.com/analyze/index

 

Thanks

 

 

Himanshu Jain

Avatar

Community Advisor

@konstantyn_diachenko ,

Can you also share the content from config.xml file under tika node.

 

Thanks

 

Himanshu Jain

Hi @Himanshu_Jain , thank you for your answer. 

I use tika/config.xml from /oak:index/damAssetLucene-11 on the AEMaaCS dev instance and it's equivalent to the one in this documentation (https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/operations/index...

<properties>
    <detectors>
        <detector class="org.apache.tika.detect.TypeDetector"/>
    </detectors>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        <mime>text/plain</mime>
        </parser>
    </parsers>
    <service-loader initializableProblemHandler="ignore" dynamic="true"/>
</properties>

 

I found the root cause. I installed assets via AEM package create on on prem AEM instance. After reprocessing of all assets, PDF content started to be searched. 

Avatar

Administrator

@konstantyn_diachenko Did you find the suggestion helpful? Please let us know if you require more information. Otherwise, please mark the answer as correct for posterity. If you've discovered a solution yourself, we would appreciate it if you could share it with the community. Thank you!



Kautuk Sahni