Adobe Experience Manager Sites & More

ncmika · 4/6/18

Hi,

recently I've been working on search functionality in AEM 6.3. I have no problem with searching by content of sites or assets' metadata, but I'm not able to properly configure searching by content of PDF files. From what I've already read it appears that indexing of PDF files' content should be implemented out of the box. Most of the articles are actually about disabling it. I will be very grateful for any snippet or any other example of proper index configuration enabling PDF content indexing.

Thanks in advance!

smacdonald2008 · 4/6/18

You can also using other Java APIs that are meant to search PDF content. This would require a custom service. See for example - PDF Text Search And PDF Text Extraction Using PDFOne (for Java)

ncmika · 4/9/18

Thank you for your reply, but I'd rather try to achieve indexing without any custom services. If it fails I'll just create my own using Tika API. If anyone here with working PDF indexing out of the box sent me his oak:index config I would be much obliged.

ncmika · 4/11/18

Update:
after extracting indexed data with Luke I've noticed that instead of extracted text :fulltext field has value: TextExtractionError. Then I've indexed data using oak-run.jar with tika added in classpath:

java -cp oak-run-1.7.4.jar;tika-app-1.17.jar org.apache.jackrabbit.oak.run.Main index --reindex --index-paths=/oak:index/lucene --read-write --fds-path="path-to-aem\crx-quickstart\repository\datastore" "path-to-aem\crx-quickstart\repository\segmentstore"

Text has been extracted successfully.

The question is: why text is not extracted using default AEM OAK Index Manager? I'm using clean pristine installation of AEM 6.3 with newest service pack.

preetim85609332 · 9/12/18

I need to implement the same. But, looks like AEM 6.3 OOB search indexes the content of the word based assets too. I can find the assets by searching a word available in the content (tried for excel, powerpoint, word and pdf). Are there any specific cases in which the OOB search fails? What additional advantages would AEM with, Solr integrated with Apache Tika offer. Is it better in terms of performance? Any help is greatly appreciated. Thanks!

Adobe Experience Manager Sites & More

Search content of pdf files using Apache Tika on AEM 6.3

Learn

Documentation

Events

Community

Support

Resources

Adobe account

Adobe