Since PDF’s invention, getting text out of a PDF file, in the correct reading order, from any PDF, including tabular data, has been a challenge. Even the best tools on the market were only good at one part of the problem or the other. Developers had to cobble together multiple tools depending on the type of PDF they had and the kind of data they needed to get out of it. The result was, they needed to know way too much about the PDF file before they knew which tool to use. A single API to extract the text to a usable form regardless of the actual content of the PDF simply didn’t exist.
One of the main reasons that PDF can be difficult for computers to read is that there are many poor PDF renderers and engines out there. While the PDFs that these engines create might create an output that can be visually read, they often were not well-structured for computers to easily understand and glean the meaning of the reading order. They work off the philosophy “if it looks right, who cares how the code of the PDF looks underneath?”
This has continued to be a challenge, until now.