This blog is the third post of a series on Apache Iceberg at Adobe. In the first blog we gave an overview of the Adobe Experience Platform architecture. We showed how data flows through the Adobe Experience Platform, how the data’s schema is laid out, and also some of the unique challenges that it poses. We also discussed the basics of Apache Iceberg and what makes it a viable solution for our platform. Iceberg today is our de-facto data format for all datasets in our data lake.
We covered issues with ingestion throughput in the previous blog in this series. We will now focus on achieving read performance using Apache Iceberg and compare how Iceberg performed in the initial prototype vs. how it does today and walk through the optimizations we did to make it work for AEP.
Here are some of the challenges we faced, from a read perspective, before Iceberg:
1. Consistency Between Data & Metadata: We had all our metadata in a separate store from our data lake as most big data systems do. This made it hard to do data restatement. Restating thousands of files in a transactional way was difficult and error-prone. Readers were often left in a potentially inconsistent state due to data being compacted or archived aggressively.
2 Metadata Scalability: Our largest tables would easily get bloated with metadata causing query planning to be cumbersome and hard to scale. Task effort planning took many trips to the metadata store and data lake. Listing files on data lake involve a recursive listing of hierarchical directories that took hours for some datasets that had years of historical data.