Adobe Experience Manager Sites & More

konstantyn_diachenko · 8/29/24

Hi everyone,

I had to customize OOTB index /oak:index/damAssetLucene-11 with stopwords, some new indexRules and suggestions for AEMaaCS. I built my customization on /oak:index/damAssetLucene-11. I added tika.config.xml as well. Example of my index you can find below:

<damAssetLucene-11-custom-2
        jcr:primaryType="oak:QueryIndexDefinition"
        async="[async,nrt]"
        compatVersion="{Long}2"
        evaluatePathRestrictions="{Boolean}true"
        excludedPaths="[/some/path]"
        includedPaths="[/content/dam]"
        maxFieldLength="{Long}100000"
        tags="[visualSimilaritySearch,assetsOmnisearch]"
        type="lucene">
	<aggregates jcr:primaryType="nt:unstructured">
            ...
        </aggregates>
        <analyzers jcr:primaryType="nt:unstructured">
            <default jcr:primaryType="nt:unstructured">
                <tokenizer jcr:primaryType="nt:unstructured" name="Standard"/>
                <filters jcr:primaryType="nt:unstructured">
                    <LowerCase jcr:primaryType="nt:unstructured"/>
                    <Stop jcr:primaryType="nt:unstructured" words="[stopwords.txt]">
                        <stopwords.txt jcr:primaryType="nt:file">
                            <jcr:content jcr:primaryType="nt:unstructured"/>
                        </stopwords.txt>
                    </Stop>
                </filters>
            </default>
        </analyzers>
	<indexRules jcr:primaryType="nt:unstructured">
            ...
        </indexRules>
	<suggestion
            jcr:primaryType="nt:unstructured"
            suggestAnalyzed="{Boolean}true"
            suggestUpdateFrequencyMinutes="{Long}5"/>
	<tika jcr:primaryType="nt:unstructured">
		<config.xml jcr:primaryType="nt:file">
			<jcr:content jcr:primaryType="nt:unstructured"/>
		</config.xml>
	</tika>
</damAssetLucene-11-custom-2>

However, with this index I can't do search by PDF content. Queries return no results.

Locally, if I remove tika/config.xml, index will return results. After the deployment to the AEMaaCS, index doesn't return PDF documents in results.

Query example: /jcr:root/content/dam/project/en/sitecontent/documents//element(*, dam:Asset)[(jcr:contains(., 'some text in the pdf*'))]/rep:excerpt(.)

By the way, after the deployment to the AEMaaCS I still have /oak:index/damAssetLucene-11-custom-1 and /oak:index/damAssetLucene-11 indexes enabled.

Do you have any ideas about potential root cause?

Kostiantyn Diachenko

Check out AEM VLT Intellij plugin

Himanshu_Jain · 8/29/24

Hi @konstantyn_diachenko ,

Can you validate your index definition using below link

https://oakutils.appspot.com/analyze/index

Thanks

Himanshu Jain

Himanshu_Jain · 8/30/24

@konstantyn_diachenko ,

Can you also share the content from config.xml file under tika node.

Thanks

Himanshu Jain

konstantyn_diachenko · 8/30/24

Hi @Himanshu_Jain , thank you for your answer.

I use tika/config.xml from /oak:index/damAssetLucene-11 on the AEMaaCS dev instance and it's equivalent to the one in this documentation (https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/operations/index...😞

<properties>
    <detectors>
        <detector class="org.apache.tika.detect.TypeDetector"/>
    </detectors>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        <mime>text/plain</mime>
        </parser>
    </parsers>
    <service-loader initializableProblemHandler="ignore" dynamic="true"/>
</properties>

Kostiantyn Diachenko

Check out AEM VLT Intellij plugin

konstantyn_diachenko · 9/4/24

I found the root cause. I installed assets via AEM package create on on prem AEM instance. After reprocessing of all assets, PDF content started to be searched.

Kostiantyn Diachenko

Check out AEM VLT Intellij plugin

kautuk_sahni · 9/4/24

@konstantyn_diachenko Did you find the suggestion helpful? Please let us know if you require more information. Otherwise, please mark the answer as correct for posterity. If you've discovered a solution yourself, we would appreciate it if you could share it with the community. Thank you!

Adobe Experience Manager Sites & More

AEMaaCS customized damAssetLucene-11 index doesn't return results from PDFs

Kostiantyn Diachenko

Check out AEM VLT Intellij plugin

Kostiantyn Diachenko

Check out AEM VLT Intellij plugin

Kostiantyn Diachenko

Check out AEM VLT Intellij plugin

Kautuk Sahni

Learn

Documentation

Events

Community

Support

Resources

Adobe account

Adobe