Highlighted

OCR performance & best practices with PDFG Livecycle ES

amarkill

13-04-2010

I have a customer with up to 3,000,000 documents in a repository (combination of MS formats mainly but with some PDFs/images such as scanned files) which they plan to convert to PDF via PDFG (LC 8.2.1) and then store via Content Services.

They also are looking at scanning additional documents to add to this repository and want them to be converted to PDF - ideally making use of OCR to identify text. Given the size of the repository - they are looking for guidance on performance for PDFG along with guidance on best practices for using OCR given the single thread requirements.

Looking around, I have found some information on this but was wondering if anyone has any best practices or advice they could share with. The type of questions they would like answering include:

  • What are the major factors are that influence the throughput for PDFG, OCR and document stitching processes? For example :-

  • impact of number of documents

  • impact of document size

  • impact of image quality on time

  • impact of number of pages per document

  • impact of large multi-page document compared to many single page documents.

  • Are there any guidelines for improving success of OCR process on PDF files or images stored in the repository e.g.:

  • Minimum DPI (seems to be 300DPI ideally 600DPI)

  • Recommended DPI

  • Recommended image file format JPG/TIFF/IMG for scan

  • Maximum angle offset if any for OCR to be successful (i.e. Is OCR expected to work if document is x degrees skewed)

  • Quality of printed document being scanned

  • Anything else that would improve quality/success rate or impact it

Thanks in advance,

Alastair

Replies

Highlighted

John_C_Cummins

21-04-2010

Did you look at the sizing PDFs on InField? They provide a lot of that info.

Another tip, PDFG in ES2 is supposed to support multi-threaded processing. Unfortunately this features does not work. There is a hotfix which is part of SP1. It works, but the overhead of managing the different processes actually DECREASES performance when it's multi-threaded. Hopefully engineering can fix this soon. The end result is that PDFG is still really just a single-threaded app.