AEM fulltext search result order

codingStar 16-10-2019

I am working on below scenario

  1. We have few PDFs file in dam
  2. Search any keyword  and  if that keyword found in any PDFs then show that in result list

I am able to achieve the above using functionality by using fulltext search in DAM. below is the query

SELECT * FROM [dam:Asset] AS a WHERE CONTAINS(a.*, '" + searchKeyword+ "') AND [jcr:path] like '/content/dam/mywebsitefolder/%'

Now next requirement is

     3. Sort the list of result based on number of occurrence of  "searchKeyword" found in pdfs.

For example : I have 3 pdfs in dam named as mypdf-1.pdf, mypdf-2.pdf, mypdf-3.pdf

PDF NamePDF Content text
mypdf-1.pdfworld
mypdf-2.pdfworld world world
mypdf-3.pdfworld world

If I am searching then result order should be like

/content/dam/mywebsitefolder/mypdf-2.pdf

/content/dam/mywebsitefolder/mypdf-3.pdf

/content/dam/mywebsitefolder/mypdf-1.pdf

Can you please share how should i write the query to get result in above mentioned order?

Answers (6)

Answers (6)

-ash
Employee
20-10-2019

Hi,

the default ordering is by relevance... You don‘t have to do anything explicitly.

But „Relevance“ is a bit more elaborately calculated than just counting the word frequency in documents.

The document

TFIDFSimilarity (Lucene 7.6.0 API)

might give you a glimpse on what is happening behind the scenes. There is also a Wikipedia article that explains the very basics

tf–idf - Wikipedia

What you have experienced in your test case might be the normalization: Relevance is not counted by term frequency but by term frequency divided by document length - to give shorter documents a chance to be relevant.

That means, you have a normalized frequency of 1/1, 2/2 and 3/3 which are all equal 1 and thus the order seems random.

If you want to validate the query, I propose you test with real-world examples.

codingStar 21-10-2019

-ash​ You mean  that I don't need  to write any extra parameter in my query(below is my query) to get result in relevance order from DAM(either its .docx or .pdf file).?

SELECT * FROM [dam:Asset] AS a WHERE CONTAINS(a.*, '" + searchKeyword+ "') AND [jcr:path] like '/content/dam/mywebsitefolder/%'

Bharath_valse 18-10-2019

This one's a tricky requirement, I believe this can be achieved via custom predicate [0] where the sorting has to happen based on the number of occurrences(count) of a search term. Here's a another forum [1] somewhat similar but with pages where the requirement was search for occurrence of a search term only twice

Another thought on the requirement it self, relevance is hard to derive based on a single search term. however you could try using use boosts [2] for index similar to below. hope this helps!

jcr:contains(., 'jelly sandwich^4') 
In this example, the word "sandwich" has weight four times more than the word "jelly."

[0]

Implementing a Custom Predicate Evaluator for the Query Builder

[1]

How to use QueryBuilder API to search a keyword a minimum of 2 times in the Page content.

[2]

Use Boosts | Indexing time and query runtime

codingStar 18-10-2019

This not working.

let me rephrase my question

I want to show most relevant file on top then so on.

Suppose pdfs contain thousands words and only few word will match with keyword 'world'  then i want to show list in order so i can say that in this list first file have the most matching word.

jbrar
Employee
16-10-2019

I believe the query is using damAssetLucene index. You can add ordered=true property to make the index as ordered

ordered

If the property is to be used in order by

clause to perform sorting then this should be set to true. This should be set to true only if the property is to be used to perform sorting as it increases the index size. Example

  • //element(*, app:Asset)[jcr:contains(type, ‘image’)] order by @size
  • //element(*, app:Asset)[jcr:contains(type, ‘image’)] order by jcr:content/@jcr:lastModified

[1] Jackrabbit Oak – Lucene Index