Community Manager

Adobe PDF Extract: API Output Demystified | AEM Community Blog Seeding

Forum|Forum|4 years ago
June 21, 2021
0 replies
851 views

Adobe PDF Extract: API Output Demystified by Adobe Tech Blog

Abstract

Video: https://youtu.be/oIG6U_dDHII

Since PDF’s invention, getting text out of a PDF file, in the correct reading order, from any PDF, including tabular data, has been a challenge. Even the best tools on the market were only good at one part of the problem or the other. Developers had to cobble together multiple tools depending on the type of PDF they had and the kind of data they needed to get out of it. The result was, they needed to know way too much about the PDF file before they knew which tool to use. A single API to extract the text to a usable form regardless of the actual content of the PDF simply didn’t exist.

One of the main reasons that PDF can be difficult for computers to read is that there are many poor PDF renderers and engines out there. While the PDFs that these engines create might create an output that can be visually read, they often were not well-structured for computers to easily understand and glean the meaning of the reading order. They work off the philosophy “if it looks right, who cares how the code of the PDF looks underneath?”

This has continued to be a challenge, until now.

Read Full Blog

Adobe PDF Extract: API Output Demystified

Q&A

Please use this thread to ask the related questions.

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.

Adobe PDF Extract: API Output Demystified by Adobe Tech Blog

Abstract

Read Full Blog

Adobe PDF Extract: API Output Demystified

Q&A

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded