I'm using the query builder to implement a search feature.
I want to optimize the query/results to find misspelled words and phrases.
Example query:
path=/content/dam/tap/master/pt/pages
type=dam:Asset
property=jcr:content/data/cq:model
property.value=/conf/tap/settings/dam/cfm/models/page
fulltext=bagagem
p.limit=-1
The goal is to search for "luggage", "bagag", "bagagm" that you will find practically the same results
I know I can use fuzzy search, but it returns too many incorrect results
To do this I created a custom index with the properties I just want to search for content and added an analyzer with the PT dictionary, but it's not working.
Views
Replies
Total Likes
Hi @JoelSo3 ,
You need a combination of phonetic matching, n-gram tokenization, and fuzzy search tuning. Here’s the corrected and optimized approach for your AEM Lucene Index:
Step1: Update Your Index to Use a Better Analyzer
Your current PortugueseAnalyzer only handles stemming and stopwords, which is not enough for spell correction. Instead, use a combination of:
- NGramFilterFactory (for partial word matching)
- PhoneticFilterFactory (for phonetic similarity)
Update Your <analyzers> Section
<analyzers jcr:primaryType="nt:unstructured">
<default jcr:primaryType="nt:unstructured">
<tokenizer class="org.apache.lucene.analysis.standard.StandardTokenizerFactory"/>
<filters jcr:primaryType="nt:unstructured">
<!-- Convert text to lowercase -->
<filter class="org.apache.lucene.analysis.core.LowerCaseFilterFactory"/>
<!-- Remove common Portuguese stopwords -->
<filter class="org.apache.lucene.analysis.pt.PortugueseStopFilterFactory"/>
<!-- Generate phonetic representations for misspelled words -->
<filter class="org.apache.lucene.analysis.phonetic.PhoneticFilterFactory">
<encoder>DoubleMetaphone</encoder>
<inject>true</inject>
</filter>
<!-- Improve matching for partial words -->
<filter class="org.apache.lucene.analysis.ngram.NGramFilterFactory">
<minGramSize>3</minGramSize>
<maxGramSize>6</maxGramSize>
</filter>
<!-- Enable fuzzy search for better misspelling detection -->
<filter class="org.apache.lucene.analysis.miscellaneous.FuzzyQueryFactory">
<maxEdits>2</maxEdits> <!-- Controls how aggressive the match is -->
<prefixLength>1</prefixLength>
<transpositions>true</transpositions>
</filter>
</filters>
</default>
</analyzers>
2. Add More Fields for Searching
In your <properties> section, make sure you index all necessary fields. Modify your XML like this:
<properties jcr:primaryType="nt:unstructured">
<title jcr:primaryType="nt:unstructured"
name="jcr:content/data/master/title"
analyzed="{Boolean}true"
nodeScopeIndex="{Boolean}true"
propertyIndex="{Boolean}false"
useInSuggest="{Boolean}true"/>
<seoDescription jcr:primaryType="nt:unstructured"
name="jcr:content/data/master/seoDescription"
analyzed="{Boolean}true"
nodeScopeIndex="{Boolean}true"
propertyIndex="{Boolean}false"
useInSuggest="{Boolean}true"/>
<customIndexedData jcr:primaryType="nt:unstructured"
name="jcr:content/data/master/customIndexedData"
analyzed="{Boolean}true"
nodeScopeIndex="{Boolean}true"
propertyIndex="{Boolean}false"
useInSuggest="{Boolean}true"/>
</properties>
3. Use a More Precise Query for Fuzzy Matching
Instead of default fulltext search, improve your query like this:
SELECT * FROM [dam:Asset] AS asset
WHERE ISDESCENDANTNODE(asset, "/content/dam/tap/master/pt/pages")
AND CONTAINS(asset.*, "bagagem~2") /* The "~2" enables fuzzy search with 2 edits */
Regards,
Amit
Views
Replies
Total Likes
Views
Like
Replies