Hi Team,
I am unable to parse or read the text of the pdf file using Tika parser.
Asset asset = DamUtil.resolveToAsset(dataResource);
Resource original = asset.getOriginal();
InputStream is = original.adaptTo(InputStream.class);
ContentHandler handler = new BodyContentHandler(10 * 1024 * 1024);
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
try {
context.set(AutoDetectParser.class, parser);
parser.parse(is, handler, metadata, context);
is.close();
} catch (Exception e) {
throw new Exception("Error parsing file"+asset.getPath(), e);
}
Getting Tika parse exception
Please help me resolve this issue or share the link where I can go through.
Thank a lot
Solved! Go to Solution.
Views
Replies
Total Likes
Hi
Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing. And it may result in JVM crashes.
Link:- https://helpx.adobe.com/experience-manager/kb/outOfProcessTextExtraction.html
I am not sure, if this is problem with you.
Please share the complete error log which you are encountering.'
Thanks and Regards
Kautuk Sahni
Views
Replies
Total Likes
Are you following online documentation to guide you on this use case or is this a custom implementation.
Views
Replies
Total Likes
Views
Replies
Total Likes
Hi,
Thank you for the quick reply.
The above is working fine with JAVA application to read the pdf content but same code is not working in AEM6.0.
I got the tika exception:
org.apache.tika.exception.TikaException: PDF parse error
On line parser.parse(is, handler, metadata, context);
Thanks
Views
Replies
Total Likes
I got the tika exception:
org.apache.tika.exception.TikaException: PDF parse error
On line parser.parse(is, handler, metadata, context);
Thanks
Views
Replies
Total Likes
Hi
Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing. And it may result in JVM crashes.
Link:- https://helpx.adobe.com/experience-manager/kb/outOfProcessTextExtraction.html
I am not sure, if this is problem with you.
Please share the complete error log which you are encountering.'
Thanks and Regards
Kautuk Sahni
Views
Replies
Total Likes
Hi Kautuk,
Thank you for the response.
You are right, I got the Tika Parsing Exception only for Large PDF files which may be of size greater than 1 MB.
Please help me how can we parse the large file using tika in AEM6.0.
Thank you
Sumit
Views
Replies
Total Likes
Hi,
Please find the full logs:
Caused by: org.apache.tika.exception.TikaException: PDF parse error
at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:252)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.leggmason.gd.webservices.utils.SolrIndex.getFileContent(SolrIndex.java:1079)
... 12 common frames omitted
Caused by: com.adobe.internal.pdftoolkit.core.exceptions.PDFSecurityAuthorizationException: Security Manager for decryption is not set
at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamEncryption(EncryptionImpl.java:223)
at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamDecryptionHandler(EncryptionImpl.java:290)
at com.adobe.internal.pdftoolkit.core.cos.CosEncryption.getStreamDecryptionStateHandler(CosEncryption.java:674)
at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamForCopying(CosStream.java:422)
at com.adobe.internal.pdftoolkit.core.cos.CosStream.copyStream(CosStream.java:367)
at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStream(CosStream.java:468)
at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamDecoded(CosStream.java:293)
at com.adobe.internal.pdftoolkit.pdf.document.PDFContents.getContents(PDFContents.java:141)
at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:94)
at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:81)
at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.<init>(ContentReader.java:54)
at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.newInstance(ContentReader.java:83)
at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:302)
at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:193)
at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.extractROTEWords(TextExtractor.java:348)
at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.getROTEWordsIterator(TextExtractor.java:505)
at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getReadingOrderedTextFromPDF(ReadingOrderTextExtractor.java:275)
at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.extractParagraphs(ReadingOrderTextExtractor.java:566)
at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getParagraphIterator(ReadingOrderTextExtractor.java:465)
at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:194)
Thanks
Views
Replies
Total Likes
Looks like you're dealing with encrypted PDF documents. I am not an PDF expert, but please reach out to Daycare support and ask how to make it work.
Jörg
Views
Replies
Total Likes
Hi,
I have around 7K documents which I am parsing using tika parser in the batch of 1K documents at a time but after 1K the workflow process goes to stale state and never comes back to parse the remaining documents. Might be JVM crashes or out of memory issue. Please help me how to handle such scenario.
PDF files are not encrypted.
Thank you
Views
Replies
Total Likes
@Sumit,
Try reducing the number of documents in one go. Also, Check the RAM, Thread pool configuration,
https://docs.adobe.com/docs/en/aem/6-1/deploy/configuring/performance.html
Jitendra
Views
Replies
Total Likes
Views
Likes
Replies
Views
Likes
Replies