Your achievements

Level 1

0% to

Level 2

Tip /
Sign in

Sign in to Community

to gain points, level up, and earn exciting badges like the new
Bedrock Mission!

Learn more

View all

Sign in to view all badges

Apache Tika config in Lucene Index and Query Flow Summary | AEM Community Blog Seeding



Apache Tika config in Lucene Index and Query Flow Summary by MyAEMLearning Blog


This post is about the Apache tika config on Lucene full text Index and summary on queries/indexing that we discussed in past few posts.
Apache Tika is used to detect and extract the text from varying file formats. It consist of Detector and Parser where Detector is used to detect the file format and Parser will parse the contents of the file.

In Lucene Index, Oak uses the default config which uses
1. TypeDetector - org.apache.tika.detect.TypeDetector
----->This detector uses the content type available in input metadata to arrive at the content type/mimeType
2. DefaultParser - org.apache.tika.parser.DefaultParser
----->Composite parser which is based on all available specific parser implementations.
----->Eg. PDFParser, MP4Parser and all other parser implementation available in Apache Tika.
3. Empty Parser - org.apache.tika.parser.EmptyParser
----->As with the name, it is a dummy parser/ not parses anything
----->Hence defining mime types within Empty Parser is equivalent to excluding them from text extraction.
----->In Default config, compressed assets and images are all excluded from extraction (related mimeType defined within Empty Parser)

Read Full Blog

Apache Tika config in Lucene Index and Query Flow Summary


Please use this thread to ask the related questions.


Topics help categorize Community content and increase your ability to discover relevant content.

0 Replies