This post is about the Apache tika config on Lucene full text Index and summary on queries/indexing that we discussed in past few posts.
Apache Tika is used to detect and extract the text from varying file formats. It consist of Detector and Parser where Detector is used to detect the file format and Parser will parse the contents of the file.
In Lucene Index, Oak uses the default config which uses
1. TypeDetector - org.apache.tika.detect.TypeDetector
----->This detector uses the content type available in input metadata to arrive at the content type/mimeType
2. DefaultParser - org.apache.tika.parser.DefaultParser
----->Composite parser which is based on all available specific parser implementations.
----->Eg. PDFParser, MP4Parser and all other parser implementation available in Apache Tika.
3. Empty Parser - org.apache.tika.parser.EmptyParser
----->As with the name, it is a dummy parser/ not parses anything
----->Hence defining mime types within Empty Parser is equivalent to excluding them from text extraction.
----->In Default config, compressed assets and images are all excluded from extraction (related mimeType defined within Empty Parser)