You've posed an interesting problem. Here is one approach that requires you to create a few steps to your Workbench process.
- Invoke the Assemble service with a DDX that extracts text information from the original PDF
- Invoke the XSLT service to convert the extracted text info into a Bookmark file.
- Invoke the Assembler with a two-part DDX with imports the Bookmark file into the original PDF and then uses the added bookmarks to disassemble the PDF.
Invoke the Assemble service with a DDX that extracts text information from the original PDF
Here is a DDX that extracts text info:
<DDX xmlns="http://ns.adobe.com/DDX/1.0/">
<DocumentText result="text">
<PDF source="myOriginalPDF"/>
</DocumentText>
</DDX>
The result will be an XML file with this appearance:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="C:\Adobe\TaxID.xslt"?>
<DocText xmlns="http://ns.adobe.com/DDX/DocText/1.0/">
<TextPerPage>
<Page pageNumber="1">to market to market</Page>
<Page pageNumber="1">TAX ID 1111 Gee I owe a lot of money to the IRS . How could this be ?</Page>
<Page pageNumber="2">TAX ID 2222 We all owe lots of money</Page>
<Page pageNumber="3">TAX ID 3333 We all owe lots of money</Page>
</TextPerPage>
</DocText>
Invoke the XSLT service to convert the extracted text info into a Bookmark file
Here is an XSLT that converts the text info into a Bookmark file:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:textInfo="http://ns.adobe.com/DDX/DocText/1.0/">
<xsl:output method="xml" version="1.0" encoding="UTF-8"/>
<xsl:template match="/">
<Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
<xsl:apply-templates/>
</Bookmarks>
</xsl:template>
<xsl:template match="textInfo:Page">
<xsl:variable name="myText" select="text()"/>
<xsl:if test='contains( $myText, "TAX ID")'>
<xsl:variable name="taxID"
select='substring($myText, 8, 4)'/>
<Bookmark><Dest>
<Fit>
<xsl:attribute name="PageNum">
<xsl:value-of select="@pageNumber"/>
</xsl:attribute>
</Fit>
</Dest>
<Title>
<xsl:value-of select="$taxID"/>
</Title>
</Bookmark>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Here is the result of this XSLT applied against the example text info:
<?xml version="1.0" encoding="UTF-8"?>
<Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
<Bookmark xmlns="">
<Dest>
<Fit PageNum="1"/>
</Dest>
<Title>1111</Title>
</Bookmark>
<Bookmark xmlns="">
<Dest>
<Fit PageNum="2"/>
</Dest>
<Title>2222</Title>
</Bookmark>
<Bookmark xmlns="">
<Dest>
<Fit PageNum="3"/>
</Dest>
<Title>3333</Title>
</Bookmark>
</Bookmarks>
If you use this XSLT, you should refine it to search for the string "TAX ID" at the beginning of the page rather than anywhere in the page. You should also improve the identification of the TAX ID number to be independent of the length.
Invoke the Assembler with a two-part DDX
Write a DDX that imports the Bookmark file into the original PDF and then uses the added bookmarks to disassemble the PDF.