I have a customer with up to 3,000,000 documents in a repository (combination of MS formats mainly but with some PDFs/images such as scanned files) which they plan to convert to PDF via PDFG (LC 8.2.1) and then store via Content Services.
They also are looking at scanning additional documents to add to this repository and want them to be converted to PDF - ideally making use of OCR to identify text. Given the size of the repository - they are looking for guidance on performance for PDFG along with guidance on best practices for using OCR given the single thread requirements.
Looking around, I have found some information on this but was wondering if anyone has any best practices or advice they could share with. The type of questions they would like answering include:
What are the major factors are that influence the throughput for PDFG, OCR and document stitching processes? For example :-
impact of number of documents
impact of document size
impact of image quality on time
impact of number of pages per document
impact of large multi-page document compared to many single page documents.
Are there any guidelines for improving success of OCR process on PDF files or images stored in the repository e.g.:
Minimum DPI (seems to be 300DPI ideally 600DPI)
Recommended image file format JPG/TIFF/IMG for scan
Maximum angle offset if any for OCR to be successful (i.e. Is OCR expected to work if document is x degrees skewed)
Quality of printed document being scanned
Anything else that would improve quality/success rate or impact it
Did you look at the sizing PDFs on InField? They provide a lot of that info.
Another tip, PDFG in ES2 is supposed to support multi-threaded processing. Unfortunately this features does not work. There is a hotfix which is part of SP1. It works, but the overhead of managing the different processes actually DECREASES performance when it's multi-threaded. Hopefully engineering can fix this soon. The end result is that PDFG is still really just a single-threaded app.