Expand my Community achievements bar.

SOLVED

Unable to parse pdf document using Tika Parser in AEM6.0

Avatar

Level 4

Hi Team,

I am unable to parse or read the text of the pdf file using Tika parser.

 

Asset asset = DamUtil.resolveToAsset(dataResource);
        Resource original = asset.getOriginal();
        InputStream is = original.adaptTo(InputStream.class);
        ContentHandler handler = new BodyContentHandler(10 * 1024 * 1024);
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        try {
        
            context.set(AutoDetectParser.class, parser);
            parser.parse(is, handler, metadata, context);

            is.close();

        } catch (Exception e) {
            throw new Exception("Error parsing file"+asset.getPath(), e);
        }

Getting Tika parse exception

Please help me resolve this issue or share the link where I can go through.

 

Thank a lot

1 Accepted Solution

Avatar

Correct answer by
Administrator

Hi 

Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing. And it may result in JVM crashes.

Link:- https://helpx.adobe.com/experience-manager/kb/outOfProcessTextExtraction.html

I am not sure, if this is problem with you. 

Please share the complete error log which you are encountering.'

 

Thanks and Regards

Kautuk Sahni



Kautuk Sahni

View solution in original post

10 Replies

Avatar

Level 10

Are you following online documentation to guide you on this use case or is this a custom implementation. 

Avatar

Level 9
would you mind sharing exception details and by the way,  what kind of issue,  you are facing? ----- Jitendra

Avatar

Level 4

Hi,

Thank you for the quick reply.

The above is working fine with JAVA application to read the pdf content but same code is not working in AEM6.0.

I got the tika exception:

org.apache.tika.exception.TikaException: PDF parse error

On line     parser.parse(is, handler, metadata, context);

Thanks

Avatar

Level 4

I got the tika exception:

org.apache.tika.exception.TikaException: PDF parse error

On line     parser.parse(is, handler, metadata, context);

Thanks

Avatar

Correct answer by
Administrator

Hi 

Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing. And it may result in JVM crashes.

Link:- https://helpx.adobe.com/experience-manager/kb/outOfProcessTextExtraction.html

I am not sure, if this is problem with you. 

Please share the complete error log which you are encountering.'

 

Thanks and Regards

Kautuk Sahni



Kautuk Sahni

Avatar

Level 4

Hi Kautuk,

Thank you for the response.

You are right, I got the Tika Parsing Exception only for Large PDF files which may be of size greater than 1 MB. 

Please help me how can we parse the large file using tika in AEM6.0.

Thank you

Sumit

Avatar

Level 4

Hi,

 

Please find the full logs:

Caused by: org.apache.tika.exception.TikaException: PDF parse error
    at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:252)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at com.leggmason.gd.webservices.utils.SolrIndex.getFileContent(SolrIndex.java:1079)
    ... 12 common frames omitted
Caused by: com.adobe.internal.pdftoolkit.core.exceptions.PDFSecurityAuthorizationException: Security Manager for decryption is not set
    at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamEncryption(EncryptionImpl.java:223)
    at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamDecryptionHandler(EncryptionImpl.java:290)
    at com.adobe.internal.pdftoolkit.core.cos.CosEncryption.getStreamDecryptionStateHandler(CosEncryption.java:674)
    at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamForCopying(CosStream.java:422)
    at com.adobe.internal.pdftoolkit.core.cos.CosStream.copyStream(CosStream.java:367)
    at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStream(CosStream.java:468)
    at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamDecoded(CosStream.java:293)
    at com.adobe.internal.pdftoolkit.pdf.document.PDFContents.getContents(PDFContents.java:141)
    at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:94)
    at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:81)
    at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.<init>(ContentReader.java:54)
    at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.newInstance(ContentReader.java:83)
    at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:302)
    at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:193)
    at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.extractROTEWords(TextExtractor.java:348)
    at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.getROTEWordsIterator(TextExtractor.java:505)
    at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getReadingOrderedTextFromPDF(ReadingOrderTextExtractor.java:275)
    at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.extractParagraphs(ReadingOrderTextExtractor.java:566)
    at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getParagraphIterator(ReadingOrderTextExtractor.java:465)
    at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:194)

 

Thanks

Avatar

Employee Advisor

Looks like you're dealing with encrypted PDF documents. I am not an PDF expert, but please reach out to Daycare support and ask how to make it work.

Jörg

Avatar

Level 4

Hi,

I have around 7K documents which I am parsing using tika parser in the batch of 1K documents at a time but after 1K the workflow process goes to stale state and never comes back to parse the remaining documents. Might be JVM crashes or out of memory issue. Please help me how to handle such scenario.

PDF files are not encrypted.

Thank you

Avatar

Level 9

@Sumit,

Try reducing the number of documents in one go. Also, Check the RAM, Thread pool configuration, etc in your instance. Some of the techniques can be found here.

https://docs.adobe.com/docs/en/aem/6-1/deploy/configuring/performance.html

 

Jitendra