Expand my Community achievements bar.

Does asset share commons search allows searching for text with in a document (PDF/Word/PPT) ?

Avatar

Employee

I searched for the term "contrary" in reference implementation in the search bar and only got a PDF. I was expecting both PDF and Word doc to show as search results as both had the same text and content. I am not sure if ASC supports search based on text/contents within the document. Can someone confirm if full-text content search within docs is supported in ASC?Screenshot 2024-09-16 at 4.01.47 PM (1).png

https://aem.enablementadobe.com/content/asset-share-commons/en/light.html?fulltext=contrary&1_group....

 

Screenshot 2024-09-16 at 2.22.15 PM.png

6 Replies

Avatar

Administrator

@h_kataria @TarunKumar @MukeshYadav_ Could you kindly take a look at this question and provide your thoughts? Your insights would be greatly appreciated.



Kautuk Sahni

Avatar

Community Advisor

@digarg17  By default, fulltext search is enabled for PDFs in DAM. You need to customize this search behavior for word documents, as below:

1. Ensure Full-Text Indexing for DOCX Files

AEM’s Oak indexing system, using Apache Tika, can extract the text content from Word documents. You need to ensure that this is correctly configured.

  • Check Tika for DOCX Support: AEM uses Tika to extract text from DOCX files. To verify that Tika is correctly configured:
    1. Go to CRX/DE Lite (/crx/de).
    2. Navigate to the DAM Lucene Index: /oak:index/damAssetLucene (or your custom asset index if you have one).
    3. Ensure that the index rules for the DAM assets include DOCX file types. You can inspect if the file type application/vnd.openxmlformats-officedocument.wordprocessingml.document is being indexed.

Ensure the indexing rule includes the content of the Word documents by ensuring properties like jcr:content/metadata are indexed.

2. Configure Asset Share Commons for DOCX Full-Text Search

Once DOCX files are being indexed, configure Asset Share Commons to allow searching within DOCX content.

  1. Modify Search Facets in Asset Share Commons:

    • Add a Full-Text Predicate to the search query, which searches through the text content extracted from the DOCX file.
    • You can include the fulltext property in your search predicates as shown in the following configuration:

    Example full-text predicate configuration:

    json
     
    { "predicates": [ { "type": "fulltext", "path": "/content/dam", "relPath": "jcr:content/metadata", // Where the DOCX text content is extracted. "property": "fulltext", "operation": "CONTAINS" } ] }
  2. Update Oak Index for DOCX Files: If you’re using a custom Oak index (e.g., damAssetLucene), ensure that the index is configured to include DOCX files. You can modify the indexing rules to ensure it includes full-text fields for DOCX files.

    In CRX/DE:

    • Navigate to /oak:index/damAssetLucene.
    • Ensure that the indexRules/dam:Asset/properties include the relevant fields for DOCX files, like jcr:content/metadata.
    • This will allow the text from DOCX files to be searchable.
  3. Modify Search Bar to Use Full-Text Predicate: Ensure that the search bar or component on your Asset Share Commons page is configured to use the fulltext predicate. This will allow users to enter search terms that will match content inside DOCX files.

3. Example Full-Text Search Query for DOCX in AEM

A JCR SQL2 query that searches for text within DOCX documents might look like this:

sql
SELECT * FROM [dam:Asset] AS asset WHERE CONTAINS(asset.[jcr:content/metadata], 'searchTerm') AND ISDESCENDANTNODE(asset, '/content/dam') AND asset.[jcr:content/metadata/dc:format] = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

This query searches for the searchTerm within the metadata of DOCX files (dc:format for DOCX is specified as 'application/vnd.openxmlformats-officedocument.wordprocessingml.document') in the /content/dam folder.

4. Testing the Implementation

Once the full-text indexing and Asset Share Commons configuration is in place:

  • Upload a Word document (DOCX) into AEM DAM.
  • Wait for AEM to index the document and extract the text content using Tika.
  • Test the search functionality in Asset Share Commons by entering a term that exists within the DOCX document to see if it returns the appropriate results.

Please let me know how it goes.

 

Avatar

Employee

On author it gives both results (doc and pdf) without doing any changes On new cloud prod instance. Issue is only on publisher with ASC.

 

Avatar

Community Advisor

HI @digarg17 ,

If the query is working for you on author and not on publisher instance then there might be some issue with indexing.
You can try to follow below steps:

1. Go to query performance console in your author <host>:<port>libs/granite/operations/content/diagnosistools/queryPerformance.html and inside explain query tab you can test the query, look if it finds any indexing.

 

2. Try to follow the step 1 in your publish instance and compare the result of author.

3. Also run your query in query debugger tool of publisher and check for results

4. Go to the indexed used in both author and publisher and compare the nodes property.

5. If you don't see any differences then manually trigger the indexing in publish.

 

-Tarun

 

 

Avatar

Administrator

@digarg17 Did you find the suggestions helpful? Please let us know if you require more information. Otherwise, please mark the answer as correct for posterity. If you've discovered a solution yourself, we would appreciate it if you could share it with the community. Thank you!



Kautuk Sahni

Avatar

Level 6

Hi @digarg17 


Short Answer yes it is supported in ASC i did a POC for a client some months back.

but you need to search the a complete text which is being mentioned in file for instance lets say "Abhishek" is there in the file it won't result until i search "Abhishek" not a single character should be missing or misplaced.