Question

Stopwords Not Working as Expected in Custom Lucene Index (e.g. "are", "was", "that")

Forum|Forum|6 months ago
July 3, 2025
3 replies
763 views

Hi Everyone,

I'm currently adding a custom stopwords.txt file to my custom Lucene index in AEM to filter out common stopwords during search. While most of the words in my list are being excluded as expected, I've noticed that some very common ones like "are", "was", and "that" are still being indexed and returned in search results.

My stopwords.txt file includes all of these terms (one per line), and I’ve confirmed the file is correctly referenced in the analyzer configuration for the index.

I’m wondering if anyone else has experienced this issue? Is there anything I might be missing related to:

File encoding or formatting of the stopwords.txt file?
Analyzer or tokenizer order/configuration?
Case sensitivity issues even with ignoreCase = true?
Possible overrides by other filters or analyzers?

Any suggestions or shared experiences would be greatly appreciated!

Thanks in advance!

D

DennisWaAuthor

Just several words are still being indexed, the rest of them work as expected! Here is my xml file:

Nilesh_Mali

Level 3

@denniswa

Verify your custom index's analyzers and filters in your oak:index definition.

The StopwordFilterFactory must be correctly placed after any lowercasing if ignoreCase="true".

It should be something like this

<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>

As per your details it looks like you are missing LowerCaseFilterFactory.

Make sure trigger a reindex after changes.

D

DennisWaAuthor

Hi Nilesh,

Thanks for replying, I just tried adding the ignoreCase = true but they are still being indexed.

 <Stop
                        jcr:primaryType="nt:unstructured"
                        ignoreCase="{Boolean}true"
                        words="stopwords.txt">
                        <stopwords.txt/>
                    </Stop>

D

DennisWaAuthor

This tokenizer is very basic — it only splits on whitespace and does not normalize punctuation, possessives, or word boundaries the StandardTokenizerFactory way does.

<tokenizer jcr:primaryType="nt:unstructured" name="Standard">
  <StandardTokenizerFactory jcr:primaryType="nt:unstructured"/>
</tokenizer>

Update stop filter to

<Stop
    jcr:primaryType="nt:unstructured"
    ignoreCase="{Boolean}true"
    words="stopwords.txt">
    <stopwords.txt/>
</Stop>

Thanks for the updates, unfortunately no luck. By the way uppercase "THAT", "ARE" will not be indexed. So I am really no idea what happened here.

kautuk_sahni

Community Manager

@denniswa Just checking in — were you able to resolve your issue?
We’d love to hear how things worked out. If the suggestions above helped, marking a response as correct can guide others with similar questions. And if you found another solution, feel free to share it — your insights could really benefit the community. Thanks again for being part of the conversation!

Kautuk Sahni

P

PoorvaJa

Level 2

Facing a similar issue. Any fixes?

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded