Stopwords Not Working as Expected in Custom Lucene Index (e.g. "are", "was", "that") | Community
Skip to main content
July 3, 2025
Question

Stopwords Not Working as Expected in Custom Lucene Index (e.g. "are", "was", "that")

  • July 3, 2025
  • 3 replies
  • 763 views

Hi Everyone,

I'm currently adding a custom stopwords.txt file to my custom Lucene index in AEM to filter out common stopwords during search. While most of the words in my list are being excluded as expected, I've noticed that some very common ones like "are", "was", and "that" are still being indexed and returned in search results.

My stopwords.txt file includes all of these terms (one per line), and I’ve confirmed the file is correctly referenced in the analyzer configuration for the index.

I’m wondering if anyone else has experienced this issue? Is there anything I might be missing related to:

  • File encoding or formatting of the stopwords.txt file?

  • Analyzer or tokenizer order/configuration?

  • Case sensitivity issues even with ignoreCase = true?

  • Possible overrides by other filters or analyzers?

Any suggestions or shared experiences would be greatly appreciated!

Thanks in advance!

3 replies

DennisWaAuthor
July 3, 2025

Just several words are still being indexed, the rest of them work as expected! Here is my xml file:

 

Nilesh_Mali
Level 3
July 3, 2025

@denniswa 

 

Verify your custom index's analyzers and filters in your oak:index definition.

The StopwordFilterFactory must be correctly placed after any lowercasing if ignoreCase="true".

 

It should be something like this

<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> </analyzer>

 

As per your details it looks like you are missing LowerCaseFilterFactory.

Make sure trigger a reindex after changes.

 

DennisWaAuthor
July 3, 2025

Hi Nilesh,

 

Thanks for replying, I just tried adding the ignoreCase = true but they are still being indexed.

 
<Stop jcr:primaryType="nt:unstructured" ignoreCase="{Boolean}true" words="stopwords.txt"> <stopwords.txt/> </Stop>
DennisWaAuthor
July 3, 2025

<WhitespaceTokenizerFactory />

This tokenizer is very basic — it only splits on whitespace and does not normalize punctuation, possessives, or word boundaries the StandardTokenizerFactory way does.

 

<tokenizer jcr:primaryType="nt:unstructured" name="Standard"> <StandardTokenizerFactory jcr:primaryType="nt:unstructured"/> </tokenizer>

 

Update stop filter to

<Stop jcr:primaryType="nt:unstructured" ignoreCase="{Boolean}true" words="stopwords.txt"> <stopwords.txt/> </Stop>

Thanks for the updates, unfortunately no luck. By the way uppercase "THAT", "ARE" will not be indexed. So I am really no idea what happened here.

kautuk_sahni
Community Manager
Community Manager
July 14, 2025

@denniswa Just checking in — were you able to resolve your issue?
We’d love to hear how things worked out. If the suggestions above helped, marking a response as correct can guide others with similar questions. And if you found another solution, feel free to share it — your insights could really benefit the community. Thanks again for being part of the conversation!

Kautuk Sahni
Level 2
September 9, 2025

Facing a similar issue. Any fixes?