Expand my Community achievements bar.

Don’t miss the AEM Skill Exchange in SF on Nov 14—hear from industry leaders, learn best practices, and enhance your AEM strategy with practical tips.

Setting Excerpt (hit.getExcerpt) true for huge volume of PDFs is making search query really slow

Avatar

Level 2

We have implemented custom search using query builder for Pages and Assets. When i set 

p.excerpt = true, I get all the information of excerpt for both pages and assets (pdfs mainly) but query gets super slow.

The query works as excepted (faster) if I just set the  excerpt for Pages but as soon as i use it for assets, it gets slow.

Is there any way to extract excerpt for huge volume of assets faster through querybuilder search.

 

(/jcr:root/content/xxx/us/en//element(*, cq:Page)[(jcr:contains(., 'xyz') and not(jcr:content/@isNotSearchable))] | /jcr:root/content/dam/xxx/documents//element(*, dam:Asset)[(jcr:contains(., 'xyz') )])/rep:excerpt(.)

 

9 Replies

Avatar

Community Advisor

@abhishekk981269  Few questions to guide you better, What version of AEM are your using? What are doing with excerpts?

Avatar

Level 2

Hi Shashi, it is aem 6.5.4

I am trying to highlight search results (full text search)

This is how i am fetching it in code (The query is mentioned in original post):

hit.getExcerpt()

The code returns the excerpt as expected but the query gets super slow if i put  p.excerpt = true for Assets

  • Our Assets (pdfs) are around 4-5 GB 

@Shashi_Mulugu 

Avatar

Community Advisor
Can you please convert your query to query builder format and run it with Explain query tool to see performance.

Avatar

Level 2

Thanks @Shashi_Mulugu. Yeah I tried that.

 

Here is my query performance  when I include p.excerpt =true.

Indexes Used
cqPageLucene(/oak:index/cqPageLucene)
damAssetLucene(/oak:index/damAssetLucene)
Execution Time
Total time: 5697 ms
  • Query execution time: 1 ms
  • Get nodes time: 42 ms
  • Result node count time: 5654 ms
  • Number of nodes in result: 3518
  Here is my query performance when I do not include p.excerpt =true in query. Indexes Used
cqPageLucene(/oak:index/cqPageLucene)
damAssetLucene(/oak:index/damAssetLucene)
Execution Time
Total time: 59 ms
  • Query execution time: 0 ms
  • Get nodes time: 4 ms
  • Result node count time: 55 ms
  • Number of nodes in result: 3518

As you can see the query response time goes down to 59 ms from 5697 ms because of p.excerpt = true.

I could not find any helpful article/resource explaining tuning of query with excerpt. 

The rest of the query is picking up the OOTB indexes to which we have added our custom property indexes for isNotSearchable. There is nothing else in the query. The query is pretty much restricted over the searchable content.


Avatar

Community Advisor
Then in that case can you please cross if those indexes have the "useinexcerpt" property enabled or not?https://jackrabbit.apache.org/oak/docs/query/lucene.html#Property_Definitions

Avatar

Level 2

Thanks @Shashi_Mulugu . I want to use the useInExcerpt but all indexes for Dam are about the metadata properties. I am not sure if I put  useInExcerpt = true on metadata properties , it will help in indexing the actual content of the pdf for useInExcerpt
Is there way I can use useInExcerpt on actual content of the pdfs so that the content of the pdf gets indexed and not the metadata properties?

Here are the dam indexes (all metadata)

abhishekk981269_0-1598288219033.png
Let me know. I appreciate your help

 

Avatar

Community Advisor
Can you check if you can add original rendition to fulltext search index and take it from there? https://jackrabbit.apache.org/oak/docs/query/lucene.html... if not i would recommend to use Solr/other search engines and index content to it which has OOTB fulltext search with excerpts enabled with optimal query performance

Avatar

Level 2
Thank you Shashi. I will check if that works for me. Appreciate your help