Expand my Community achievements bar.

Query Builder: Create custom index to find misspelled words

Avatar

Level 1

I'm using the query builder to implement a search feature. 

I want to optimize the query/results to find misspelled words and phrases.


Example query:
path=/content/dam/tap/master/pt/pages
type=dam:Asset
property=jcr:content/data/cq:model
property.value=/conf/tap/settings/dam/cfm/models/page
fulltext=bagagem
p.limit=-1

The goal is to search for "luggage", "bagag", "bagagm" that you will find practically the same results

I know I can use fuzzy search, but it returns too many incorrect results

To do this I created a custom index with the properties I just want to search for content and added an analyzer with the PT dictionary, but it's not working.

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:dam="http://www.day.com/dam/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0" xmlns:oak="http://jackrabbit.apache.org/oak/ns/1.0" xmlns:rep="internal"
    jcr:mixinTypes="[rep:AccessControllable]"
    jcr:primaryType="oak:QueryIndexDefinition"
    async="[async,nrt]"
    compatVersion="{Long}2"
    evaluatePathRestrictions="{Boolean}true"
    includedPaths="[/content/dam/project/master/pt/pages]"
    queryPaths="[/content/dam/project/master/pt/pages]"
    maxFieldLength="{Long}100000"
    type="lucene"
    reindex="{Boolean}true">
    <indexRules jcr:primaryType="nt:unstructured">
        <dam:Asset jcr:primaryType="nt:unstructured">
            <properties jcr:primaryType="nt:unstructured">

               <title
                    jcr:primaryType="nt:unstructured"
                    _comment="title to be included in fulltext"
                    isRegexp="{Boolean}false"
                    name="jcr:content/data/master/title"
                    analyzed="{Boolean}true"
                    nodeScopeIndex="{Boolean}true"
                    propertyIndex="{Boolean}false"
                    useInSuggest="{Boolean}false"/>

                <seoDescription
                    jcr:primaryType="nt:unstructured"
                    _comment="seoDescription to be included in fulltext"
                    isRegexp="{Boolean}false"
                    name="jcr:content/data/master/seoDescription"
                    analyzed="{Boolean}true"
                    nodeScopeIndex="{Boolean}true"
                    propertyIndex="{Boolean}false"
                    useInSuggest="{Boolean}false"/>

                <customIndexedData
                    jcr:primaryType="nt:unstructured"
                    _comment="customIndexedData to be included in fulltext"
                    isRegexp="{Boolean}false"
                    name="jcr:content/data/master/customIndexedData"
                    analyzed="{Boolean}true"
                    nodeScopeIndex="{Boolean}true"
                    propertyIndex="{Boolean}false"
                    useInSuggest="{Boolean}false"/>
            </properties>
        </dam:Asset>
    </indexRules>
    <analyzers jcr:primaryType="nt:unstructured">
        <default
            jcr:primaryType="nt:unstructured"
            class="org.apache.lucene.analysis.pt.PortugueseAnalyzer" />
    </analyzers>
</jcr:root>


What can I do to improve this feature?

 

2 Replies

Avatar

Community Advisor

Hi @JoelSo3 ,

You need a combination of phonetic matching, n-gram tokenization, and fuzzy search tuning. Here’s the corrected and optimized approach for your AEM Lucene Index:

Step1: Update Your Index to Use a Better Analyzer

Your current PortugueseAnalyzer only handles stemming and stopwords, which is not enough for spell correction. Instead, use a combination of:

     - NGramFilterFactory (for partial word matching)

     - PhoneticFilterFactory (for phonetic similarity)

Update Your <analyzers> Section

<analyzers jcr:primaryType="nt:unstructured">
    <default jcr:primaryType="nt:unstructured">
        <tokenizer class="org.apache.lucene.analysis.standard.StandardTokenizerFactory"/>
        <filters jcr:primaryType="nt:unstructured">
            <!-- Convert text to lowercase -->
            <filter class="org.apache.lucene.analysis.core.LowerCaseFilterFactory"/>
            
            <!-- Remove common Portuguese stopwords -->
            <filter class="org.apache.lucene.analysis.pt.PortugueseStopFilterFactory"/>
            
            <!-- Generate phonetic representations for misspelled words -->
            <filter class="org.apache.lucene.analysis.phonetic.PhoneticFilterFactory">
                <encoder>DoubleMetaphone</encoder>
                <inject>true</inject>
            </filter>

            <!-- Improve matching for partial words -->
            <filter class="org.apache.lucene.analysis.ngram.NGramFilterFactory">
                <minGramSize>3</minGramSize>
                <maxGramSize>6</maxGramSize>
            </filter>

            <!-- Enable fuzzy search for better misspelling detection -->
            <filter class="org.apache.lucene.analysis.miscellaneous.FuzzyQueryFactory">
                <maxEdits>2</maxEdits> <!-- Controls how aggressive the match is -->
                <prefixLength>1</prefixLength>
                <transpositions>true</transpositions>
            </filter>
        </filters>
    </default>
</analyzers>

 

2. Add More Fields for Searching

In your <properties> section, make sure you index all necessary fields. Modify your XML like this:

<properties jcr:primaryType="nt:unstructured">
    <title jcr:primaryType="nt:unstructured"
           name="jcr:content/data/master/title"
           analyzed="{Boolean}true"
           nodeScopeIndex="{Boolean}true"
           propertyIndex="{Boolean}false"
           useInSuggest="{Boolean}true"/>

    <seoDescription jcr:primaryType="nt:unstructured"
                    name="jcr:content/data/master/seoDescription"
                    analyzed="{Boolean}true"
                    nodeScopeIndex="{Boolean}true"
                    propertyIndex="{Boolean}false"
                    useInSuggest="{Boolean}true"/>

    <customIndexedData jcr:primaryType="nt:unstructured"
                       name="jcr:content/data/master/customIndexedData"
                       analyzed="{Boolean}true"
                       nodeScopeIndex="{Boolean}true"
                       propertyIndex="{Boolean}false"
                       useInSuggest="{Boolean}true"/>
</properties>

 

3. Use a More Precise Query for Fuzzy Matching

Instead of default fulltext search, improve your query like this:

SELECT * FROM [dam:Asset] AS asset
WHERE ISDESCENDANTNODE(asset, "/content/dam/tap/master/pt/pages")
AND CONTAINS(asset.*, "bagagem~2")  /* The "~2" enables fuzzy search with 2 edits */


Regards,
Amit

 

Avatar

Level 1
<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:dam="http://www.day.com/dam/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0" xmlns:oak="http://jackrabbit.apache.org/oak/ns/1.0" xmlns:rep="internal"
    jcr:mixinTypes="[rep:AccessControllable]"
    jcr:primaryType="oak:QueryIndexDefinition"
    async="[async,nrt]"
    compatVersion="{Long}2"
    evaluatePathRestrictions="{Boolean}true"
    includedPaths="[/content/dam/tap/master/pt/pages]"
    queryPaths="[/content/dam/tap/master/pt/pages]"
    maxFieldLength="{Long}100000"
    type="lucene"
    reindex="{Boolean}true">
    <indexRules jcr:primaryType="nt:unstructured">
        <dam:Asset jcr:primaryType="nt:unstructured">
            <properties jcr:primaryType="nt:unstructured">

               <title
                    jcr:primaryType="nt:unstructured"
                    _comment="title to be included in fulltext"
                    isRegexp="{Boolean}false"
                    name="jcr:content/data/master/title"
                    analyzed="{Boolean}true"
                    nodeScopeIndex="{Boolean}true"
                    propertyIndex="{Boolean}false"
                    useInSuggest="{Boolean}false"/>

                <seoDescription
                    jcr:primaryType="nt:unstructured"
                    _comment="seoDescription to be included in fulltext"
                    isRegexp="{Boolean}false"
                    name="jcr:content/data/master/seoDescription"
                    analyzed="{Boolean}true"
                    nodeScopeIndex="{Boolean}true"
                    propertyIndex="{Boolean}false"
                    useInSuggest="{Boolean}false"/>

                <customIndexedData
                    jcr:primaryType="nt:unstructured"
                    _comment="customIndexedData to be included in fulltext"
                    isRegexp="{Boolean}false"
                    name="jcr:content/data/master/customIndexedData"
                    analyzed="{Boolean}true"
                    nodeScopeIndex="{Boolean}true"
                    propertyIndex="{Boolean}false"
                    useInSuggest="{Boolean}false"/>

            </properties>



        </dam:Asset>
    </indexRules>

    <analyzers jcr:primaryType="nt:unstructured">
        <default jcr:primaryType="nt:unstructured">
            <tokenizer jcr:primaryType="nt:unstructured" class="org.apache.lucene.analysis.standard.StandardTokenizerFactory"/>
            <filters jcr:primaryType="nt:unstructured">
                <!-- Convert text to lowercase -->
                <filter1 jcr:primaryType="nt:unstructured" class="org.apache.lucene.analysis.core.LowerCaseFilterFactory"/>
               
                <!-- Remove common Portuguese stopwords -->
                <filter2 jcr:primaryType="nt:unstructured" class="org.apache.lucene.analysis.pt.PortugueseStopFilterFactory"/>
               
                <filter3
                    jcr:primaryType="nt:unstructured"
                    class="org.apache.lucene.analysis.phonetic.PhoneticFilterFactory"
                    encoder="DoubleMetaphone"
                    inject="{Boolean}true"/>

                <!-- Improve matching for partial words -->
                <filter4
                    jcr:primaryType="nt:unstructured"
                    class="org.apache.lucene.analysis.ngram.NGramFilterFactory"
                    minGramSize="{Long}3"
                    maxGramSize="{Long}6"/>

                <!-- Enable fuzzy search for better misspelling detection -->
              <filter5
                    jcr:primaryType="nt:unstructured"
                    class="org.apache.lucene.analysis.miscellaneous.FuzzyQueryFactory"
                    maxEdits="{Long}2"
                    prefixLength="{Long}1"
                    transpositions="{Boolean}true"/>

            </filters>
        </default>
    </analyzers>
         
   
</jcr:root>


I made the changes to my index, but I still don't get results when the word isn't written well.

path=/content/dam/tap/master/pt/pages
type=dam:Asset
property=jcr:content/data/cq:model
property.value=/conf/tap/settings/dam/cfm/models/page
fulltext=bagagm
p.limit=-1

I don't want to add the fuzzysearch option to the query.

If I search for "bagagem" I have 45r results, but if I search for "bagagm" I have 0