Expand my Community achievements bar.

SOLVED

PDF to XML conversion using adobe acrobat

Avatar

Level 1
Hi guys, I wanted to convert my pdf to xml format using the sdk. I did try to do it directly from the app, but after few pages, the structure is getting messed up, like every single sub heading is just under one header. So i had to make multiple small pdfs for the same pdf and manually convert them through the app. But this is eating away my time. I wanted to ask if: 1) Am i doing something wrong with the software? 2) Is there option to convert pdf to xml sdk? if yes can anyone provide me with the code please.
1 Accepted Solution

Avatar

Correct answer by
Employee

Hi @Yaswanth_ReddyBh , 

 

What is the software being used for this conversion? Are you using AEM Forms product for this conversion? Or are you using Acrobat PDF Services APIs? Or automation written on top of Acrobat? 

 

My guess is you are using Acrobat PDF Services APIs. You have posted this question on AEM Forms forums which is not the right place for this question. 

 

Regards, 
Sufyan

View solution in original post

5 Replies

Avatar

Employee

Hi @Yaswanth_ReddyBh,

If you want to convert your PDF to XML, then you can use the ExportPDF option in AEM which will allow you to convert any native PDF to XML format.

 You can review the sample code at[0] to get the code to convert the document in XML. Additionally you can use the "exportPDF" method[1] in which you can pass the "ConvertPDFFormatType" argument[2] to get the desired output XML

 

[0] https://experienceleague.adobe.com/en/docs/experience-manager-65/content/forms/developer-reference/p...

[1] https://developer.adobe.com/experience-manager/reference-materials/6-5/forms/programlc/javadoc/com/a....

[2] https://developer.adobe.com/experience-manager/reference-materials/6-5/forms/programlc/javadoc/com/a...

Thanks
Pranay

Avatar

Employee Advisor

How was your pdf created?

was it from xml document. If pdf was created from word document and if you export it as xml the results will be unpredictable 

ideally source of the pdf and the export format has to be the same to get best results 

Avatar

Level 1

Hi,

Situation
So i have a ocr'd pdf like this(please look at the attachements), Now What im currently doing is, i manually converted the the ocr pdf to xml file. The pdf has around 350 pages, and converting everything at a time into xml is not keeping the format, so i had to do 2 pages convertion everytime. 

Yaswanth_ReddyBh_0-1750907539237.png

 


I tried sdk but it doest have a pdf to xml convertion option(atleast whats available in the readme file).

Manually converted xml is nice and all but there some cases where its not taking the tables. Like below where it should have been a table but ended up a sub heading.

 

Yaswanth_ReddyBh_2-1750907765890.png

 

 

Needed output

The final i put i want is in the form of jsonl, converting pdf to xml is keeping the hierarchial structure including the tables, thats what i want.

Yaswanth_ReddyBh_1-1750907604427.png

 


What im currently doing

 

Currently i converted my pdf to json, which is keeping the structure, but needs a lot of parsing.

Im working not making a vector data base for my RAG model, and having perfectly formatted data is key. So thats the reason im going extreme lengths to keep the formating of the tables.

Avatar

Correct answer by
Employee

Hi @Yaswanth_ReddyBh , 

 

What is the software being used for this conversion? Are you using AEM Forms product for this conversion? Or are you using Acrobat PDF Services APIs? Or automation written on top of Acrobat? 

 

My guess is you are using Acrobat PDF Services APIs. You have posted this question on AEM Forms forums which is not the right place for this question. 

 

Regards, 
Sufyan

Avatar

Administrator

@Yaswanth_ReddyBh Just checking in — were you able to resolve your issue?
We’d love to hear how things worked out. If the suggestions above helped, marking a response as correct can guide others with similar questions. And if you found another solution, feel free to share it — your insights could really benefit the community. Thanks again for being part of the conversation!



Kautuk Sahni