Adobe Experience Manager Sites & More

JoelSo3 · 3/26/25

I'm using the query builder to implement a search feature.

I want to optimize the query/results to find misspelled words and phrases.

Example query:
path=/content/dam/tap/master/pt/pages
type=dam:Asset
property=jcr:content/data/cq:model
property.value=/conf/tap/settings/dam/cfm/models/page
fulltext=bagagem
p.limit=-1

The goal is to search for "luggage", "bagag", "bagagm" that you will find practically the same results

I know I can use fuzzy search, but it returns too many incorrect results

To do this I created a custom index with the properties I just want to search for content and added an analyzer with the PT dictionary, but it's not working.

<?xml version="1.0" encoding="UTF-8"?>

<jcr:root xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:dam="http://www.day.com/dam/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0" xmlns:oak="http://jackrabbit.apache.org/oak/ns/1.0" xmlns:rep="internal"

jcr:mixinTypes="[rep:AccessControllable]"

jcr:primaryType="oak:QueryIndexDefinition"

async="[async,nrt]"

compatVersion="{Long}2"

evaluatePathRestrictions="{Boolean}true"

includedPaths="[/content/dam/project/master/pt/pages]"

queryPaths="[/content/dam/project/master/pt/pages]"

maxFieldLength="{Long}100000"

type="lucene"

reindex="{Boolean}true">

<dam:Asset jcr:primaryType="nt:unstructured">

<title

jcr:primaryType="nt:unstructured"

_comment="title to be included in fulltext"

isRegexp="{Boolean}false"

name="jcr:content/data/master/title"

analyzed="{Boolean}true"

nodeScopeIndex="{Boolean}true"

propertyIndex="{Boolean}false"

useInSuggest="{Boolean}false"/>

<seoDescription

jcr:primaryType="nt:unstructured"

_comment="seoDescription to be included in fulltext"

isRegexp="{Boolean}false"

name="jcr:content/data/master/seoDescription"

analyzed="{Boolean}true"

nodeScopeIndex="{Boolean}true"

propertyIndex="{Boolean}false"

useInSuggest="{Boolean}false"/>

<customIndexedData

jcr:primaryType="nt:unstructured"

_comment="customIndexedData to be included in fulltext"

isRegexp="{Boolean}false"

name="jcr:content/data/master/customIndexedData"

analyzed="{Boolean}true"

nodeScopeIndex="{Boolean}true"

propertyIndex="{Boolean}false"

useInSuggest="{Boolean}false"/>

</properties>

</dam:Asset>

</indexRules>

<analyzers jcr:primaryType="nt:unstructured">

<default

jcr:primaryType="nt:unstructured"

class="org.apache.lucene.analysis.pt.PortugueseAnalyzer" />

</analyzers>

</jcr:root>

What can I do to improve this feature?

AmitVishwakarma · 3/26/25

Hi @JoelSo3 ,

You need a combination of phonetic matching, n-gram tokenization, and fuzzy search tuning. Here’s the corrected and optimized approach for your AEM Lucene Index:

Step1: Update Your Index to Use a Better Analyzer

Your current PortugueseAnalyzer only handles stemming and stopwords, which is not enough for spell correction. Instead, use a combination of:

- NGramFilterFactory (for partial word matching)

- PhoneticFilterFactory (for phonetic similarity)

Update Your <analyzers> Section

<analyzers jcr:primaryType="nt:unstructured">
    <default jcr:primaryType="nt:unstructured">
        <tokenizer class="org.apache.lucene.analysis.standard.StandardTokenizerFactory"/>
        <filters jcr:primaryType="nt:unstructured">
            <!-- Convert text to lowercase -->
            <filter class="org.apache.lucene.analysis.core.LowerCaseFilterFactory"/>
            
            <!-- Remove common Portuguese stopwords -->
            <filter class="org.apache.lucene.analysis.pt.PortugueseStopFilterFactory"/>
            
            <!-- Generate phonetic representations for misspelled words -->
            <filter class="org.apache.lucene.analysis.phonetic.PhoneticFilterFactory">
                <encoder>DoubleMetaphone</encoder>
                <inject>true</inject>
            </filter>

            <!-- Improve matching for partial words -->
            <filter class="org.apache.lucene.analysis.ngram.NGramFilterFactory">
                <minGramSize>3</minGramSize>
                <maxGramSize>6</maxGramSize>
            </filter>

            <!-- Enable fuzzy search for better misspelling detection -->
            <filter class="org.apache.lucene.analysis.miscellaneous.FuzzyQueryFactory">
                <maxEdits>2</maxEdits> <!-- Controls how aggressive the match is -->
                <prefixLength>1</prefixLength>
                <transpositions>true</transpositions>
            </filter>
        </filters>
    </default>
</analyzers>

2. Add More Fields for Searching

In your <properties> section, make sure you index all necessary fields. Modify your XML like this:

<properties jcr:primaryType="nt:unstructured">
    <title jcr:primaryType="nt:unstructured"
           name="jcr:content/data/master/title"
           analyzed="{Boolean}true"
           nodeScopeIndex="{Boolean}true"
           propertyIndex="{Boolean}false"
           useInSuggest="{Boolean}true"/>

    <seoDescription jcr:primaryType="nt:unstructured"
                    name="jcr:content/data/master/seoDescription"
                    analyzed="{Boolean}true"
                    nodeScopeIndex="{Boolean}true"
                    propertyIndex="{Boolean}false"
                    useInSuggest="{Boolean}true"/>

    <customIndexedData jcr:primaryType="nt:unstructured"
                       name="jcr:content/data/master/customIndexedData"
                       analyzed="{Boolean}true"
                       nodeScopeIndex="{Boolean}true"
                       propertyIndex="{Boolean}false"
                       useInSuggest="{Boolean}true"/>
</properties>

3. Use a More Precise Query for Fuzzy Matching

Instead of default fulltext search, improve your query like this:

SELECT * FROM [dam:Asset] AS asset
WHERE ISDESCENDANTNODE(asset, "/content/dam/tap/master/pt/pages")
AND CONTAINS(asset.*, "bagagem~2")  /* The "~2" enables fuzzy search with 2 edits */

Regards,
Amit

JoelSo3 · 3/27/25

<?xml version="1.0" encoding="UTF-8"?>

<jcr:root xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:dam="http://www.day.com/dam/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0" xmlns:oak="http://jackrabbit.apache.org/oak/ns/1.0" xmlns:rep="internal"

jcr:mixinTypes="[rep:AccessControllable]"

jcr:primaryType="oak:QueryIndexDefinition"

async="[async,nrt]"

compatVersion="{Long}2"

evaluatePathRestrictions="{Boolean}true"

includedPaths="[/content/dam/tap/master/pt/pages]"

queryPaths="[/content/dam/tap/master/pt/pages]"

maxFieldLength="{Long}100000"

type="lucene"

reindex="{Boolean}true">

<dam:Asset jcr:primaryType="nt:unstructured">

<title

jcr:primaryType="nt:unstructured"

_comment="title to be included in fulltext"

isRegexp="{Boolean}false"

name="jcr:content/data/master/title"

analyzed="{Boolean}true"

nodeScopeIndex="{Boolean}true"

propertyIndex="{Boolean}false"

useInSuggest="{Boolean}false"/>

<seoDescription

jcr:primaryType="nt:unstructured"

_comment="seoDescription to be included in fulltext"

isRegexp="{Boolean}false"

name="jcr:content/data/master/seoDescription"

analyzed="{Boolean}true"

nodeScopeIndex="{Boolean}true"

propertyIndex="{Boolean}false"

useInSuggest="{Boolean}false"/>

<customIndexedData

jcr:primaryType="nt:unstructured"

_comment="customIndexedData to be included in fulltext"

isRegexp="{Boolean}false"

name="jcr:content/data/master/customIndexedData"

analyzed="{Boolean}true"

nodeScopeIndex="{Boolean}true"

propertyIndex="{Boolean}false"

useInSuggest="{Boolean}false"/>

</properties>

</dam:Asset>

</indexRules>

<filter3

jcr:primaryType="nt:unstructured"

class="org.apache.lucene.analysis.phonetic.PhoneticFilterFactory"

encoder="DoubleMetaphone"

inject="{Boolean}true"/>

<filter4

jcr:primaryType="nt:unstructured"

class="org.apache.lucene.analysis.ngram.NGramFilterFactory"

minGramSize="{Long}3"

maxGramSize="{Long}6"/>

<filter5

jcr:primaryType="nt:unstructured"

class="org.apache.lucene.analysis.miscellaneous.FuzzyQueryFactory"

maxEdits="{Long}2"

prefixLength="{Long}1"

transpositions="{Boolean}true"/>

</filters>

</default>

</analyzers>

</jcr:root>

I made the changes to my index, but I still don't get results when the word isn't written well.

path=/content/dam/tap/master/pt/pages
type=dam:Asset
property=jcr:content/data/cq:model
property.value=/conf/tap/settings/dam/cfm/models/page
fulltext=bagagm
p.limit=-1

I don't want to add the fuzzysearch option to the query.

If I search for "bagagem" I have 45r results, but if I search for "bagagm" I have 0

Adobe Experience Manager Sites & More

Query Builder: Create custom index to find misspelled words

Learn

Documentation

Events

Community

Support

Resources

Adobe account

Adobe