I am trying to convert PDF to HTML. The idea is to use PDF extract API to extract layout plus styling information - which generates a Json output. I am trying to write a simple python program to parse through the Json and generate corresponding HTML element.
For eg. <Figure> json element can be converted to <img> HTML element.
The mapping between Json tag to HTML tag is straight forwards. I am confused with multiple multiple bound attibutes. For instance
"Bounds": [
87.047607421875,
2307.354721069336,
158.33660888671875,
2371.139617919922
],
"ClipBounds": [
87.047607421875,
2307.354721069336,
158.33660888671875,
2371.139617919922
],
"Page": 0,
"Path": "//Document/Sect/Figure",
"attributes": {
"BBox": [
200.45099999999366,
2591.719999999972,
271.7289999999921,
2655.5599999999395
],
"Placement": "Block"
},
"filePaths": [
"figures/fileoutpart1.png"
],
Which coordinate should be used to decide placement of corresponding <IMG> tag in HTML? Bound or Clipbound or Bbox