Your achievements

0% to

Tip / to gain points, level up, and earn exciting badges like the new

Adobe Experience Manager Sites & More

Nomination window for the Adobe Community Advisor Program, Class of 2025, is now open!

Apply now!

Apache Tika config in Lucene Index and Query Flow Summary | AEM Community Blog Seeding

Apache Tika config in Lucene Index and Query Flow Summary by MyAEMLearning Blog

Abstract

This post is about the Apache tika config on Lucene full text Index and summary on queries/indexing that we discussed in past few posts.
Apache Tika is used to detect and extract the text from varying file formats. It consist of Detector and Parser where Detector is used to detect the file format and Parser will parse the contents of the file.

In Lucene Index, Oak uses the default config which uses
1. TypeDetector - org.apache.tika.detect.TypeDetector
----->This detector uses the content type available in input metadata to arrive at the content type/mimeType
2. DefaultParser - org.apache.tika.parser.DefaultParser
----->Composite parser which is based on all available specific parser implementations.
----->Eg. PDFParser, MP4Parser and all other parser implementation available in Apache Tika.
3. Empty Parser - org.apache.tika.parser.EmptyParser
----->As with the name, it is a dummy parser/ not parses anything
----->Hence defining mime types within Empty Parser is equivalent to excluding them from text extraction.
----->In Default config, compressed assets and images are all excluded from extraction (related mimeType defined within Empty Parser)

Read Full Blog

Apache Tika config in Lucene Index and Query Flow Summary

Q&A

Please use this thread to ask the related questions.

Kautuk Sahni

Topics

AEMIBlogSeeding Experience Manager

1.6K

0 Replies

Related Conversations

Vue.JS templates in AEM Edge Delivery Services

Integration of OpenAI and GenAI with AEM

152

Adobe Summit 2025, AEM Skill Exchange | S901 Unlocking the Power of Universal Editor

135

Adobe Summit 2025, AEM Skill Exchange | S900 Adobe Experience Manager Security in the Age of Generative AI

141

Adobe Summit 2025, AEM Session/LAB | S336 Adobe Experience Manager Rockstar IX

138

Adobe Experience Manager Sites & More

Apache Tika config in Lucene Index and Query Flow Summary | AEM Community Blog Seeding

Apache Tika config in Lucene Index and Query Flow Summary by MyAEMLearning Blog

Abstract

Read Full Blog

Apache Tika config in Lucene Index and Query Flow Summary

Q&A

Kautuk Sahni

Learn

Documentation

Community

Support

Resources

Adobe account

Adobe