PDF to XML conversion using adobe acrobat

Question

Hi guys, I wanted to convert my pdf to xml format using the sdk. I did try to do it directly from the app, but after few pages, the structure is getting messed up, like every single sub heading is just under one header. So i had to make multiple small pdfs for the same pdf and manually convert them through the app. But this is eating away my time. I wanted to ask if: 1) Am i doing something wrong with the software? 2) Is there option to convert pdf to xml sdk? if yes can anyone provide me with the code please.

sharoon23 · Accepted Answer

Hi,SituationSo i have a ocr'd pdf like this(please look at the attachements), Now What im currently doing is, i manually converted the the ocr pdf to xml file. The pdf has around 350 pages, and converting everything at a time into xml is not keeping the format, so i had to do 2 pages convertion everytime.  I tried sdk but it doest have a pdf to xml convertion option(atleast whats available in the readme file).Manually converted xml is nice and all but there some cases where its not taking the tables. Like below where it should have been a table but ended up a sub heading.   Needed outputThe final i put i want is in the form of jsonl, converting pdf to xml is keeping the hierarchial structure including the tables, thats what i want. What im currently doing Currently i converted my pdf to json, which is keeping the structure, but needs a lot of parsing.Im working not making a vector data base for my RAG model, and having perfectly formatted data is key. So thats the reason im going extreme lengths to keep the formating of the tables.Hi @yaswanth_reddybh ,  What is the software being used for this conversion? Are you using AEM Forms product for this conversion? Or are you using Acrobat PDF Services APIs? Or automation written on top of Acrobat?  My guess is you are using Acrobat PDF Services APIs. You have posted this question on AEM Forms forums which is not the right place for this question.  Regards, Sufyan

Pranay_M · Answer

Hi @yaswanth_reddybh,If you want to convert your PDF to XML, then you can use the ExportPDF option in AEM which will allow you to convert any native PDF to XML format. You can review the sample code at[0] to get the code to convert the document in XML. Additionally you can use the "exportPDF" method[1] in which you can pass the "ConvertPDFFormatType" argument[2] to get the desired output XML [0] https://experienceleague.adobe.com/en/docs/experience-manager-65/content/forms/developer-reference/programming-aem-forms-jee/java-api-quick-start-code-examples/generate-pdf-service-java-api#quick-start-soap-mode-converting-a-pdf-document-to-an-rtf-file-using-the-java-api-soap-mode[1] https://developer.adobe.com/experience-manager/reference-materials/6-5/forms/programlc/javadoc/com/adobe/livecycle/generatepdf/client/GeneratePdfServiceClient.html#:~:text=Class%20GeneratePdfServiceClient&text=Provides%20methods%20for%20creating%20PDF,PDF%20in%20Programming%20with%20Forms.[2] https://developer.adobe.com/experience-manager/reference-materials/6-5/forms/programlc/javadoc/com/adobe/livecycle/generatepdf/client/ConvertPDFFormatType.htmlThanksPranay

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded