I am working on below scenario
Search any keyword and if that keyword found in any PDFs then show that in result list
I am able to achieve the above using functionality by using fulltext search in DAM. below is the query
SELECT * FROM [dam:Asset] AS a WHERE CONTAINS(a.*, '" + searchKeyword+ "') AND [jcr:path] like '/content/dam/mywebsitefolder/%'
Now next requirement is
3. Sort the list of result based on number of occurrence of "searchKeyword" found in pdfs.
For example : I have 3 pdfs in dam named as mypdf-1.pdf, mypdf-2.pdf, mypdf-3.pdf
PDF Name | PDF Content text |
---|---|
mypdf-1.pdf | world |
mypdf-2.pdf | world world world |
mypdf-3.pdf | world world |
If I am searching then result order should be like
/content/dam/mywebsitefolder/mypdf-2.pdf
/content/dam/mywebsitefolder/mypdf-3.pdf
/content/dam/mywebsitefolder/mypdf-1.pdf
Can you please share how should i write the query to get result in above mentioned order?
Solved! Go to Solution.
You can add a Boost in your index rule as follows:
How about adding Search Boost to the AEM Asset too?:
Views
Replies
Total Likes
I believe the query is using damAssetLucene index. You can add ordered=true property to make the index as ordered
ordered
If the property is to be used in order by
clause to perform sorting then this should be set to true. This should be set to true only if the property is to be used to perform sorting as it increases the index size. Example
Views
Replies
Total Likes
This not working.
let me rephrase my question
I want to show most relevant file on top then so on.
Suppose pdfs contain thousands words and only few word will match with keyword 'world' then i want to show list in order so i can say that in this list first file have the most matching word.
Views
Replies
Total Likes
This one's a tricky requirement, I believe this can be achieved via custom predicate [0] where the sorting has to happen based on the number of occurrences(count) of a search term. Here's a another forum [1] somewhat similar but with pages where the requirement was search for occurrence of a search term only twice
Another thought on the requirement it self, relevance is hard to derive based on a single search term. however you could try using use boosts [2] for index similar to below. hope this helps!
jcr:contains(., 'jelly sandwich^4')
In this example, the word "sandwich" has weight four times more than the word "jelly."
[0]
Implementing a Custom Predicate Evaluator for the Query Builder
[1]
How to use QueryBuilder API to search a keyword a minimum of 2 times in the Page content.
[2]
Views
Replies
Total Likes
Hi,
the default ordering is by relevance... You don‘t have to do anything explicitly.
But „Relevance“ is a bit more elaborately calculated than just counting the word frequency in documents.
The document
TFIDFSimilarity (Lucene 7.6.0 API)
might give you a glimpse on what is happening behind the scenes. There is also a Wikipedia article that explains the very basics
What you have experienced in your test case might be the normalization: Relevance is not counted by term frequency but by term frequency divided by document length - to give shorter documents a chance to be relevant.
That means, you have a normalized frequency of 1/1, 2/2 and 3/3 which are all equal 1 and thus the order seems random.
If you want to validate the query, I propose you test with real-world examples.
-ash You mean that I don't need to write any extra parameter in my query(below is my query) to get result in relevance order from DAM(either its .docx or .pdf file).?
SELECT * FROM [dam:Asset] AS a WHERE CONTAINS(a.*, '" + searchKeyword+ "') AND [jcr:path] like '/content/dam/mywebsitefolder/%'
Views
Replies
Total Likes
exactly :-)
Views
Replies
Total Likes
You can add a Boost in your index rule as follows:
How about adding Search Boost to the AEM Asset too?:
Views
Replies
Total Likes
Views
Likes
Replies