AEM fulltext search result order

Avatar

Avatar

codingStar

Avatar

codingStar

codingStar

16-10-2019

I am working on below scenario

  1. We have few PDFs file in dam
  2. Search any keyword  and  if that keyword found in any PDFs then show that in result list

I am able to achieve the above using functionality by using fulltext search in DAM. below is the query

SELECT * FROM [dam:Asset] AS a WHERE CONTAINS(a.*, '" + searchKeyword+ "') AND [jcr:path] like '/content/dam/mywebsitefolder/%'

Now next requirement is

     3. Sort the list of result based on number of occurrence of  "searchKeyword" found in pdfs.

For example : I have 3 pdfs in dam named as mypdf-1.pdf, mypdf-2.pdf, mypdf-3.pdf

PDF NamePDF Content text
mypdf-1.pdfworld
mypdf-2.pdfworld world world
mypdf-3.pdfworld world

If I am searching then result order should be like

/content/dam/mywebsitefolder/mypdf-2.pdf

/content/dam/mywebsitefolder/mypdf-3.pdf

/content/dam/mywebsitefolder/mypdf-1.pdf

Can you please share how should i write the query to get result in above mentioned order?

Accepted Solutions (1)

Accepted Solutions (1)

Avatar

Avatar

sunjot16

Employee

Avatar

sunjot16

Employee

sunjot16
Employee

22-10-2019

You can add a Boost in your index rule as follows:

Jackrabbit Oak – Lucene Index

How about adding Search Boost to the AEM Asset too?:

Search Boost

Answers (6)

Answers (6)

Avatar

Avatar

-ash

Employee

Avatar

-ash

Employee

-ash
Employee

20-10-2019

Hi,

the default ordering is by relevance... You don‘t have to do anything explicitly.

But „Relevance“ is a bit more elaborately calculated than just counting the word frequency in documents.

The document

TFIDFSimilarity (Lucene 7.6.0 API)

might give you a glimpse on what is happening behind the scenes. There is also a Wikipedia article that explains the very basics

tf–idf - Wikipedia

What you have experienced in your test case might be the normalization: Relevance is not counted by term frequency but by term frequency divided by document length - to give shorter documents a chance to be relevant.

That means, you have a normalized frequency of 1/1, 2/2 and 3/3 which are all equal 1 and thus the order seems random.

If you want to validate the query, I propose you test with real-world examples.

Avatar

Avatar

-ash

Employee

Avatar

-ash

Employee

-ash
Employee

21-10-2019

exactly 🙂

Avatar

Avatar

codingStar

Avatar

codingStar

codingStar

21-10-2019

-ash​ You mean  that I don't need  to write any extra parameter in my query(below is my query) to get result in relevance order from DAM(either its .docx or .pdf file).?

SELECT * FROM [dam:Asset] AS a WHERE CONTAINS(a.*, '" + searchKeyword+ "') AND [jcr:path] like '/content/dam/mywebsitefolder/%'

Avatar

Avatar

Bharath_valse

Avatar

Bharath_valse

Bharath_valse

18-10-2019

This one's a tricky requirement, I believe this can be achieved via custom predicate [0] where the sorting has to happen based on the number of occurrences(count) of a search term. Here's a another forum [1] somewhat similar but with pages where the requirement was search for occurrence of a search term only twice

Another thought on the requirement it self, relevance is hard to derive based on a single search term. however you could try using use boosts [2] for index similar to below. hope this helps!

jcr:contains(., 'jelly sandwich^4') 
In this example, the word "sandwich" has weight four times more than the word "jelly."

[0]

Implementing a Custom Predicate Evaluator for the Query Builder

[1]

How to use QueryBuilder API to search a keyword a minimum of 2 times in the Page content.

[2]

Use Boosts | Indexing time and query runtime

Avatar

Avatar

codingStar

Avatar

codingStar

codingStar

18-10-2019

This not working.

let me rephrase my question

I want to show most relevant file on top then so on.

Suppose pdfs contain thousands words and only few word will match with keyword 'world'  then i want to show list in order so i can say that in this list first file have the most matching word.

Avatar

Avatar

jbrar

Employee

Avatar

jbrar

Employee

jbrar
Employee

16-10-2019

I believe the query is using damAssetLucene index. You can add ordered=true property to make the index as ordered

ordered

If the property is to be used in order by

clause to perform sorting then this should be set to true. This should be set to true only if the property is to be used to perform sorting as it increases the index size. Example

  • //element(*, app:Asset)[jcr:contains(type, ‘image’)] order by @size
  • //element(*, app:Asset)[jcr:contains(type, ‘image’)] order by jcr:content/@jcr:lastModified

[1] Jackrabbit Oak – Lucene Index