Hi Everyone,
I'm currently adding a custom stopwords.txt file to my custom Lucene index in AEM to filter out common stopwords during search. While most of the words in my list are being excluded as expected, I've noticed that some very common ones like "are", "was", and "that" are still being indexed and returned in search results.
My stopwords.txt file includes all of these terms (one per line), and I’ve confirmed the file is correctly referenced in the analyzer configuration for the index.
I’m wondering if anyone else has experienced this issue? Is there anything I might be missing related to:
File encoding or formatting of the stopwords.txt file?
Analyzer or tokenizer order/configuration?
Case sensitivity issues even with ignoreCase = true?
Possible overrides by other filters or analyzers?
Any suggestions or shared experiences would be greatly appreciated!
Thanks in advance!
Views
Replies
Total Likes
Just several words are still being indexed, the rest of them work as expected! Here is my xml file:
Verify your custom index's analyzers and filters in your oak:index definition.
The StopwordFilterFactory must be correctly placed after any lowercasing if ignoreCase="true".
It should be something like this
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
As per your details it looks like you are missing LowerCaseFilterFactory.
Make sure trigger a reindex after changes.
Hi Nilesh,
Thanks for replying, I just tried adding the ignoreCase = true but they are still being indexed.
<Stop
jcr:primaryType="nt:unstructured"
ignoreCase="{Boolean}true"
words="stopwords.txt">
<stopwords.txt/>
</Stop>
Views
Replies
Total Likes
<WhitespaceTokenizerFactory />
This tokenizer is very basic — it only splits on whitespace and does not normalize punctuation, possessives, or word boundaries the StandardTokenizerFactory way does.
<tokenizer jcr:primaryType="nt:unstructured" name="Standard">
<StandardTokenizerFactory jcr:primaryType="nt:unstructured"/>
</tokenizer>
Update stop filter to
<Stop
jcr:primaryType="nt:unstructured"
ignoreCase="{Boolean}true"
words="stopwords.txt">
<stopwords.txt/>
</Stop>
Thanks for the updates, unfortunately no luck. By the way uppercase "THAT", "ARE" will not be indexed. So I am really no idea what happened here.
Views
Replies
Total Likes
@DennisWa Just checking in — were you able to resolve your issue?
We’d love to hear how things worked out. If the suggestions above helped, marking a response as correct can guide others with similar questions. And if you found another solution, feel free to share it — your insights could really benefit the community. Thanks again for being part of the conversation!
Views
Replies
Total Likes
Views
Likes
Replies