Expand my Community achievements bar.

SOLVED

Lucene search for Chinese market

Avatar

Level 2

In our application, we have multiple market search, both pages and dam. Below is our basic query. Search works fine for most markets, but for China market, seems like we are getting lot of irrelevant results. Not able to find how Lucene search is working for non-English languages. Are there any configurations required for this? Also, do we have any specific Analyzer configuration for Chinese language?

fulltext=insurance

1_group.p.or=true

1_group.1_group.path=/content/product/us

1_group.2_group.path=/content/dam/product/us

2_group.p.or=true

2_group.1_group.type=cq:Page

2_group.2_group.type=dam:Asset

p.excerpt=true

AEM version: 6.2.0, SP1 - CFP8

Oak version: Apache Jackrabbit Oak 1.4.17

Thanks,

Vazahat Fatima P

1 Accepted Solution

Avatar

Correct answer by
Community Advisor

Dear Vazahat,

Normally, AEM tries to index for English Language, Lucene by standard also has everything configured for english language, indexes are also setup to follow English semantics.

AEM/OAK/Lucene/Java does not do any magic, it only crunches your data into numbers(hashes/hello inverted index), compares numbers of the matches and shows you them in the certain order. When you get irrelevant results it means that your indexes cotain irrelevant data. Therefore you need to correct:

a) How the data get's into your indexes

b) How you retrieve data from your indexes

It's fairly hard to get this 'right' just with plain Oak-Lucene integration.[0]

Please consider using Oak Solr extension[1] that provide support for Chinese language and human readable format of configuration.

Also, can recommend recent book on Relevancy by Doug[2]

[0] Issue with an oak index using snonym filter

[1] Language Analysis | Apache Solr Reference Guide 6.6

[2] Relevant Search: With applications for Solr and Elasticsearch: Doug Turnbull, John Berryman: 9781617...

Regards,

Peter

View solution in original post

1 Reply

Avatar

Correct answer by
Community Advisor

Dear Vazahat,

Normally, AEM tries to index for English Language, Lucene by standard also has everything configured for english language, indexes are also setup to follow English semantics.

AEM/OAK/Lucene/Java does not do any magic, it only crunches your data into numbers(hashes/hello inverted index), compares numbers of the matches and shows you them in the certain order. When you get irrelevant results it means that your indexes cotain irrelevant data. Therefore you need to correct:

a) How the data get's into your indexes

b) How you retrieve data from your indexes

It's fairly hard to get this 'right' just with plain Oak-Lucene integration.[0]

Please consider using Oak Solr extension[1] that provide support for Chinese language and human readable format of configuration.

Also, can recommend recent book on Relevancy by Doug[2]

[0] Issue with an oak index using snonym filter

[1] Language Analysis | Apache Solr Reference Guide 6.6

[2] Relevant Search: With applications for Solr and Elasticsearch: Doug Turnbull, John Berryman: 9781617...

Regards,

Peter