Adobe Experience Manager Sites & More

_SumitSinghal · 1/8/16

Hi Team,

I am unable to parse or read the text of the pdf file using Tika parser.

Asset asset = DamUtil.resolveToAsset(dataResource);
       Resource original = asset.getOriginal();
   InputStream is = original.adaptTo(InputStream.class);
       ContentHandler handler = new BodyContentHandler(10 * 1024 * 1024);
       Metadata metadata = new Metadata();
       AutoDetectParser parser = new AutoDetectParser();
       ParseContext context = new ParseContext();
       try {

           context.set(AutoDetectParser.class, parser);
           parser.parse(is, handler, metadata, context);

is.close();

       } catch (Exception e) {
           throw new Exception("Error parsing file"+asset.getPath(), e);
       }

Getting Tika parse exception

Please help me resolve this issue or share the link where I can go through.

Thank a lot

kautuk_sahni · 1/10/16

Hi

Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing. And it may result in JVM crashes.

Link:- https://helpx.adobe.com/experience-manager/kb/outOfProcessTextExtraction.html

I am not sure, if this is problem with you.

Please share the complete error log which you are encountering.'

Thanks and Regards

Kautuk Sahni

View solution in original post

smacdonald2008 · 1/8/16

Are you following online documentation to guide you on this use case or is this a custom implementation.

Jitendra_S_Toma · 1/8/16

would you mind sharing exception details and by the way, what kind of issue, you are facing? ----- Jitendra

_SumitSinghal · 1/8/16

Hi,

Thank you for the quick reply.

The above is working fine with JAVA application to read the pdf content but same code is not working in AEM6.0.

I got the tika exception:

org.apache.tika.exception.TikaException: PDF parse error

On line parser.parse(is, handler, metadata, context);

Thanks

_SumitSinghal · 1/8/16

I got the tika exception:

org.apache.tika.exception.TikaException: PDF parse error

On line parser.parse(is, handler, metadata, context);

Thanks

kautuk_sahni · 1/10/16

Hi

Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing. And it may result in JVM crashes.

Link:- https://helpx.adobe.com/experience-manager/kb/outOfProcessTextExtraction.html

I am not sure, if this is problem with you.

Please share the complete error log which you are encountering.'

Thanks and Regards

Kautuk Sahni

_SumitSinghal · 1/11/16

Hi Kautuk,

Thank you for the response.

You are right, I got the Tika Parsing Exception only for Large PDF files which may be of size greater than 1 MB.

Please help me how can we parse the large file using tika in AEM6.0.

Thank you

Sumit

_SumitSinghal · 1/11/16

Hi,

Please find the full logs:

Caused by: org.apache.tika.exception.TikaException: PDF parse error
   at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:252)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
   at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at com.leggmason.gd.webservices.utils.SolrIndex.getFileContent(SolrIndex.java:1079)
   ... 12 common frames omitted
Caused by: com.adobe.internal.pdftoolkit.core.exceptions.PDFSecurityAuthorizationException: Security Manager for decryption is not set
   at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamEncryption(EncryptionImpl.java:223)
   at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamDecryptionHandler(EncryptionImpl.java:290)
   at com.adobe.internal.pdftoolkit.core.cos.CosEncryption.getStreamDecryptionStateHandler(CosEncryption.java:674)
   at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamForCopying(CosStream.java:422)
   at com.adobe.internal.pdftoolkit.core.cos.CosStream.copyStream(CosStream.java:367)
   at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStream(CosStream.java:468)
   at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamDecoded(CosStream.java:293)
   at com.adobe.internal.pdftoolkit.pdf.document.PDFContents.getContents(PDFContents.java:141)
   at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:94)
   at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:81)
   at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.<init>(ContentReader.java:54)
   at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.newInstance(ContentReader.java:83)
   at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:302)
   at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:193)
   at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.extractROTEWords(TextExtractor.java:348)
   at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.getROTEWordsIterator(TextExtractor.java:505)
   at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getReadingOrderedTextFromPDF(ReadingOrderTextExtractor.java:275)
   at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.extractParagraphs(ReadingOrderTextExtractor.java:566)
   at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getParagraphIterator(ReadingOrderTextExtractor.java:465)
   at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:194)

Thanks

Jörg_Hoh · 1/14/16

Looks like you're dealing with encrypted PDF documents. I am not an PDF expert, but please reach out to Daycare support and ask how to make it work.

Jörg

_SumitSinghal · 1/15/16

Hi,

I have around 7K documents which I am parsing using tika parser in the batch of 1K documents at a time but after 1K the workflow process goes to stale state and never comes back to parse the remaining documents. Might be JVM crashes or out of memory issue. Please help me how to handle such scenario.

PDF files are not encrypted.

Thank you

Jitendra_S_Toma · 1/15/16

@Sumit,

Try reducing the number of documents in one go. Also, Check the RAM, Thread pool configuration, etc in your instance. Some of the techniques can be found here.

https://docs.adobe.com/docs/en/aem/6-1/deploy/configuring/performance.html

Jitendra