Solved! Go to Solution.
Views
Replies
Total Likes
Hi @Yaswanth_ReddyBh ,
What is the software being used for this conversion? Are you using AEM Forms product for this conversion? Or are you using Acrobat PDF Services APIs? Or automation written on top of Acrobat?
My guess is you are using Acrobat PDF Services APIs. You have posted this question on AEM Forms forums which is not the right place for this question.
Regards,
Sufyan
If you want to convert your PDF to XML, then you can use the ExportPDF option in AEM which will allow you to convert any native PDF to XML format.
You can review the sample code at[0] to get the code to convert the document in XML. Additionally you can use the "exportPDF" method[1] in which you can pass the "ConvertPDFFormatType" argument[2] to get the desired output XML
[0] https://experienceleague.adobe.com/en/docs/experience-manager-65/content/forms/developer-reference/p...
[1] https://developer.adobe.com/experience-manager/reference-materials/6-5/forms/programlc/javadoc/com/a....
[2] https://developer.adobe.com/experience-manager/reference-materials/6-5/forms/programlc/javadoc/com/a...
Thanks
Pranay
How was your pdf created?
was it from xml document. If pdf was created from word document and if you export it as xml the results will be unpredictable
ideally source of the pdf and the export format has to be the same to get best results
Hi,
Situation
So i have a ocr'd pdf like this(please look at the attachements), Now What im currently doing is, i manually converted the the ocr pdf to xml file. The pdf has around 350 pages, and converting everything at a time into xml is not keeping the format, so i had to do 2 pages convertion everytime.
I tried sdk but it doest have a pdf to xml convertion option(atleast whats available in the readme file).
Manually converted xml is nice and all but there some cases where its not taking the tables. Like below where it should have been a table but ended up a sub heading.
Needed output
The final i put i want is in the form of jsonl, converting pdf to xml is keeping the hierarchial structure including the tables, thats what i want.
What im currently doing
Currently i converted my pdf to json, which is keeping the structure, but needs a lot of parsing.
Im working not making a vector data base for my RAG model, and having perfectly formatted data is key. So thats the reason im going extreme lengths to keep the formating of the tables.
Views
Replies
Total Likes
Hi @Yaswanth_ReddyBh ,
What is the software being used for this conversion? Are you using AEM Forms product for this conversion? Or are you using Acrobat PDF Services APIs? Or automation written on top of Acrobat?
My guess is you are using Acrobat PDF Services APIs. You have posted this question on AEM Forms forums which is not the right place for this question.
Regards,
Sufyan
@Yaswanth_ReddyBh Just checking in — were you able to resolve your issue?
We’d love to hear how things worked out. If the suggestions above helped, marking a response as correct can guide others with similar questions. And if you found another solution, feel free to share it — your insights could really benefit the community. Thanks again for being part of the conversation!
Views
Replies
Total Likes