Expand my Community achievements bar.

DisAssembling a PDF based on a text string on the page

Avatar

Former Community Member

I am looking for some guidance (and an example if at all possible) on how to disassemble a multipage pdf based on text like "Tax ID" contained on certain pages. The result is that I am looking to break up a document that contains 1000 pages, 100 of those pages may contain the text "Tax ID" for 100 different people and I would like 100 different PDF's with the 1 page that has their "Tax ID" as the output. In addition...it would be great to extract the value next to the text "Tax ID" so that the PDF's could be named accordingly.

The challenge here is how do I get the page numbers that contain the "Text ID" text along with the text sitting to the right of that text? Once I get that...then I can simply feed that information back into Assembler via the DDX for extraction.

Any help here would be greatly appreciated.

1 Reply

Avatar

Level 2

You've posed an interesting problem. Here is one approach that requires you to create a few steps to your Workbench process.

  1. Invoke the Assemble service with a DDX that extracts text information from the original PDF
  2. Invoke the XSLT service to convert the extracted text info into a Bookmark file.
  3. Invoke the Assembler with a two-part DDX with imports the Bookmark file into the original PDF and then uses the added bookmarks to disassemble the PDF.

Invoke the Assemble service with a DDX that extracts text information from the original PDF

Here is a DDX that extracts text info:

<DDX xmlns="http://ns.adobe.com/DDX/1.0/">
  <DocumentText result="text">
    <PDF source="myOriginalPDF"/>
  </DocumentText>
</DDX>

The result will be an XML file with this appearance:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="C:\Adobe\TaxID.xslt"?>
<DocText xmlns="http://ns.adobe.com/DDX/DocText/1.0/">
    <TextPerPage>
        <Page pageNumber="1">to market to market</Page>
        <Page pageNumber="1">TAX ID 1111 Gee I owe a lot of money to the IRS . How could this be ?</Page>
        <Page pageNumber="2">TAX ID 2222 We all owe lots of money</Page>
        <Page pageNumber="3">TAX ID 3333 We all owe lots of money</Page>
    </TextPerPage>
</DocText>

Invoke the XSLT service to convert the extracted text info into a Bookmark file

Here is an XSLT that converts the text info into a Bookmark file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:textInfo="http://ns.adobe.com/DDX/DocText/1.0/">
    <xsl:output method="xml" version="1.0" encoding="UTF-8"/>
    <xsl:template match="/">
        <Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
        <xsl:apply-templates/>
            </Bookmarks>
    </xsl:template>
   
    <xsl:template match="textInfo:Page">
    <xsl:variable name="myText" select="text()"/>
    <xsl:if  test='contains( $myText, "TAX ID")'>
        <xsl:variable name="taxID"
            select='substring($myText, 8, 4)'/>
            <Bookmark><Dest>
            <Fit>
                <xsl:attribute name="PageNum">
                <xsl:value-of select="@pageNumber"/>
                </xsl:attribute>
            </Fit>
            </Dest>
                <Title>
                <xsl:value-of select="$taxID"/>           
                </Title>
            </Bookmark>
        </xsl:if>
    </xsl:template>    
</xsl:stylesheet>

Here is the result of this XSLT applied against the example text info:

<?xml version="1.0" encoding="UTF-8"?>
<Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
    <Bookmark xmlns="">
        <Dest>
            <Fit PageNum="1"/>
        </Dest>
        <Title>1111</Title>
    </Bookmark>
    <Bookmark xmlns="">
        <Dest>
            <Fit PageNum="2"/>
        </Dest>
        <Title>2222</Title>
    </Bookmark>
    <Bookmark xmlns="">
        <Dest>
            <Fit PageNum="3"/>
        </Dest>
        <Title>3333</Title>
    </Bookmark>
</Bookma
rks>

If you use this XSLT, you should refine it to search for the string "TAX ID" at the beginning of the page rather than anywhere in the page. You should also improve the identification of the TAX ID number to be independent of the length.

Invoke the Assembler with a two-part DDX

Write a DDX that imports the Bookmark file into the original PDF and then uses the added bookmarks to disassemble the PDF.