Expand my Community achievements bar.

Enhance your AEM Assets & Boost Your Development: [AEM Gems | June 19, 2024] Improving the Developer Experience with New APIs and Events

Search content of pdf files using Apache Tika on AEM 6.3

Avatar

Level 1

Hi,

recently I've been working on search functionality in AEM 6.3. I have no problem with searching by content of sites or assets' metadata, but I'm not able to properly configure searching by content of PDF files. From what I've already read it appears that indexing of PDF files' content should be implemented out of the box. Most of the articles are actually about disabling it. I will be very grateful for any snippet or any other example of proper index configuration enabling PDF content indexing.

Thanks in advance!

4 Replies

Avatar

Level 10

You can also using other Java APIs that are meant to search PDF content. This would require a custom service. See for example - PDF Text Search And PDF Text Extraction Using PDFOne (for Java)

Avatar

Level 1

Thank you for your reply, but I'd rather try to achieve indexing without any custom services. If it fails I'll just create my own using Tika API. If anyone here with working PDF indexing out of the box sent me his oak:index config I would be much obliged.

Avatar

Level 1

Update:
after extracting indexed data with Luke I've noticed that instead of extracted text :fulltext field has value: TextExtractionError. Then I've indexed data using oak-run.jar with tika added in classpath:

java -cp oak-run-1.7.4.jar;tika-app-1.17.jar org.apache.jackrabbit.oak.run.Main index --reindex --index-paths=/oak:index/lucene --read-write --fds-path="path-to-aem\crx-quickstart\repository\datastore" "path-to-aem\crx-quickstart\repository\segmentstore"

Text has been extracted successfully.

The question is: why text is not extracted using default AEM OAK Index Manager? I'm using clean pristine installation of AEM 6.3 with newest service pack.

Avatar

Level 1

I need to implement the same. But, looks like AEM 6.3 OOB search indexes the content of the word based assets too. I can find the assets by searching a word available in the content (tried for excel, powerpoint, word and pdf). Are there any specific cases in which the OOB search fails? What additional advantages would AEM with, Solr integrated with Apache Tika offer. Is it better in terms of performance? Any help is greatly appreciated. Thanks!