I am looking for some guidance (and an example if at all possible) on how to disassemble a multipage pdf based on text like "Tax ID" contained on certain pages. The result is that I am looking to break up a document that contains 1000 pages, 100 of those pages may contain the text "Tax ID" for 100 different people and I would like 100 different PDF's with the 1 page that has their "Tax ID" as the output. In addition...it would be great to extract the value next to the text "Tax ID" so that the PDF's could be named accordingly.
The challenge here is how do I get the page numbers that contain the "Text ID" text along with the text sitting to the right of that text? Once I get that...then I can simply feed that information back into Assembler via the DDX for extraction.
Any help here would be greatly appreciated.
Views
Replies
Total Likes
You've posed an interesting problem. Here is one approach that requires you to create a few steps to your Workbench process.
Here is a DDX that extracts text info:
<DDX xmlns="http://ns.adobe.com/DDX/1.0/">
<DocumentText result="text">
<PDF source="myOriginalPDF"/>
</DocumentText>
</DDX>
The result will be an XML file with this appearance:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="C:\Adobe\TaxID.xslt"?>
<DocText xmlns="http://ns.adobe.com/DDX/DocText/1.0/">
<TextPerPage>
<Page pageNumber="1">to market to market</Page>
<Page pageNumber="1">TAX ID 1111 Gee I owe a lot of money to the IRS . How could this be ?</Page>
<Page pageNumber="2">TAX ID 2222 We all owe lots of money</Page>
<Page pageNumber="3">TAX ID 3333 We all owe lots of money</Page>
</TextPerPage>
</DocText>
Here is an XSLT that converts the text info into a Bookmark file:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:textInfo="http://ns.adobe.com/DDX/DocText/1.0/">
<xsl:output method="xml" version="1.0" encoding="UTF-8"/>
<xsl:template match="/">
<Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
<xsl:apply-templates/>
</Bookmarks>
</xsl:template>
<xsl:template match="textInfo:Page">
<xsl:variable name="myText" select="text()"/>
<xsl:if test='contains( $myText, "TAX ID")'>
<xsl:variable name="taxID"
select='substring($myText, 8, 4)'/>
<Bookmark><Dest>
<Fit>
<xsl:attribute name="PageNum">
<xsl:value-of select="@pageNumber"/>
</xsl:attribute>
</Fit>
</Dest>
<Title>
<xsl:value-of select="$taxID"/>
</Title>
</Bookmark>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Here is the result of this XSLT applied against the example text info:
<?xml version="1.0" encoding="UTF-8"?>
<Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
<Bookmark xmlns="">
<Dest>
<Fit PageNum="1"/>
</Dest>
<Title>1111</Title>
</Bookmark>
<Bookmark xmlns="">
<Dest>
<Fit PageNum="2"/>
</Dest>
<Title>2222</Title>
</Bookmark>
<Bookmark xmlns="">
<Dest>
<Fit PageNum="3"/>
</Dest>
<Title>3333</Title>
</Bookmark>
</Bookmarks>
If you use this XSLT, you should refine it to search for the string "TAX ID" at the beginning of the page rather than anywhere in the page. You should also improve the identification of the TAX ID number to be independent of the length.
Write a DDX that imports the Bookmark file into the original PDF and then uses the added bookmarks to disassemble the PDF.
Views
Replies
Total Likes