Expand my Community achievements bar.

Join expert-led, customer-led sessions on Adobe Experience Manager Assets on August 20th at our Skill Exchange.

Stopwords Not Working as Expected in Custom Lucene Index (e.g. "are", "was", "that")

Avatar

Level 1

Hi Everyone,

I'm currently adding a custom stopwords.txt file to my custom Lucene index in AEM to filter out common stopwords during search. While most of the words in my list are being excluded as expected, I've noticed that some very common ones like "are", "was", and "that" are still being indexed and returned in search results.

My stopwords.txt file includes all of these terms (one per line), and I’ve confirmed the file is correctly referenced in the analyzer configuration for the index.

I’m wondering if anyone else has experienced this issue? Is there anything I might be missing related to:

  • File encoding or formatting of the stopwords.txt file?

  • Analyzer or tokenizer order/configuration?

  • Case sensitivity issues even with ignoreCase = true?

  • Possible overrides by other filters or analyzers?

Any suggestions or shared experiences would be greatly appreciated!

Thanks in advance!

6 Replies

Avatar

Level 1

Just several words are still being indexed, the rest of them work as expected! Here is my xml file:

Screenshot 2025-07-03 at 2.45.13 pm.png

 

Avatar

Level 4

@DennisWa 

 

Verify your custom index's analyzers and filters in your oak:index definition.

The StopwordFilterFactory must be correctly placed after any lowercasing if ignoreCase="true".

 

It should be something like this

<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>

 

As per your details it looks like you are missing LowerCaseFilterFactory.

Make sure trigger a reindex after changes.

 

Avatar

Level 1

Hi Nilesh,

 

Thanks for replying, I just tried adding the ignoreCase = true but they are still being indexed.

 
 <Stop
                        jcr:primaryType="nt:unstructured"
                        ignoreCase="{Boolean}true"
                        words="stopwords.txt">
                        <stopwords.txt/>
                    </Stop>

Avatar

Level 4

<WhitespaceTokenizerFactory />

This tokenizer is very basic — it only splits on whitespace and does not normalize punctuation, possessives, or word boundaries the StandardTokenizerFactory way does.

 

<tokenizer jcr:primaryType="nt:unstructured" name="Standard">
  <StandardTokenizerFactory jcr:primaryType="nt:unstructured"/>
</tokenizer>

 

Update stop filter to

<Stop
    jcr:primaryType="nt:unstructured"
    ignoreCase="{Boolean}true"
    words="stopwords.txt">
    <stopwords.txt/>
</Stop>

Avatar

Level 1

Thanks for the updates, unfortunately no luck. By the way uppercase "THAT", "ARE" will not be indexed. So I am really no idea what happened here.

Avatar

Administrator

@DennisWa Just checking in — were you able to resolve your issue?
We’d love to hear how things worked out. If the suggestions above helped, marking a response as correct can guide others with similar questions. And if you found another solution, feel free to share it — your insights could really benefit the community. Thanks again for being part of the conversation!



Kautuk Sahni