One suggestion that I would love to make to the Adobe team is - can we ask them to offer other data format options other than just CSV files? If the file format could be a modern file format like parquet, then that would also make our processing much more efficient.
When we get the Adobe files, the first thing we have to do is pre-process them to format the files, because of the way they are structured. This slows down the entire process by a lot, and is also expensive.
The format of the CSV files that are provided are not "splittable" to be able to be processed by a big data tool like Spark or Databricks. The reason is that there are new-line characters embedded in some of fields, and CSV files have to be processed serially when that is the case. This is an inherent limitation of the design of the CSV format. As a result, these huge files, which are multiple GB in size, have to be processed initially using a single machine, instead of leveraging the power of parallel machine design such as Spark. This limits our ability to quickly process the data.
Switching to a format such as parquet, would mean that the data would not only be able to be processed in parallel, but would also mean that it would automatically be compressed, saving on both storage and compute costs to process the files.