Adobe Experience Manager Sites & More

kautuk_sahni · 9/26/21

Text Pre-Extraction in AEM : Comprehensive Guide by Aemcq5tutorials

Abstract

Text pre-extraction in AEM is very useful and highly recommended for re/indexing Lucene indexes on repositories with large binaries that contain extractable text (eg. PDFs, Word Docs, PPTs, TXT, etc.). Running re-indexing directly on lucene indexes is very expensive and may cause performance issues. After completing this tutorial you will be able to understand:-

Text pre-extraction overview.
When to use text pre-extraction in AEM.
When not to use text pre-extraction in AEM.
Prerequisites for using text pre-extraction.
Execute text pre-extraction.
Validate OAK Index Consistency.
Text Pre-extraction Overview
Text pre-extraction is the process of extracting and processing text from binaries that contain extractable text (eg. PDFs, Word Docs, PPTs, TXT, etc.) Extracting text from binaries is an expensive operation and slows down the indexing rate considerably. Lucene indexing is performed in a single threaded mode.

For incremental indexing this mostly works fine but if performing a re-index or creating the index for the first time after migration then it increases the indexing time considerably. To speed up such cases Oak supports pre extracting text from binaries to avoid extracting text at indexing time. This feature consist of two main steps :-

Extract and store the extracted text from binaries using oak-run tooling.
Configure Oak runtime to use the extracted text at time of indexing via PreExtractedTextProvider
Oak text pre-extraction is recommended for re/indexing Lucene indexes on repositories with large volumes of files (binaries) that contain extractable text (eg. PDFs, Word Docs, PPTs, TXT, etc.) that qualify for full-text search via deployed Oak indexes; for example /oak:index/damAssetLucene.
Text pre-extraction will only benefit the re/indexing of Lucene indexes, and NOT Oak property indexes, since property indexes do not extract text from binaries.
Text pre-extraction has a high positive impact when the full-text re-indexing of text-heavy binaries (PDF, Doc, TXT, etc.), where as a repository of images will not enjoy the same efficiencies since images do not contain extractable text.
Text pre-extraction performs the extraction of full-text search related text in a extra-efficient manner, and exposes it to the Oak re/indexing process in a way that is extra-efficient to consume.

Read Full Blog

Text Pre-Extraction in AEM : Comprehensive Guide

Q&A

Please use this thread to ask the related questions.

Adobe Experience Manager Sites & More

Text Pre-Extraction in AEM : Comprehensive Guide | AEM Community Blog Seeding

Text Pre-Extraction in AEM : Comprehensive Guide by Aemcq5tutorials

Abstract

Read Full Blog

Text Pre-Extraction in AEM : Comprehensive Guide

Q&A

Kautuk Sahni

Learn

Documentation

Events

Community

Support

Resources

Adobe account

Adobe