In our previous blogs, Iceberg At Adobe, Data ingestion with Buffered writes to Iceberg, and Optimized Reads With Iceberg we understood the benefits of Apache Iceberg and how it fits in the overall Adobe Experience Platform architecture. In this blog, we will share our story of migrating 1 PB+ datasets to Iceberg on Adobe Experience Platform Data Lake, the challenges we faced, and lessons learned.
Adobe Experience Platform is an open system for driving real-time personalized experiences. Customers use it to centralize and standardize their data across the enterprise resulting in a 360-degree view of their data of interest. That view can then be used with intelligent services to drive experiences across multiple devices, run targeted campaigns, classify profiles and other entities into segments, and leverage advanced analytics. At the center of our Data Lake Architecture is the underlying storage. Data Lake relies on a Hadoop Distributed File System (HDFS) compatible backend for data storage, which today is the cloud-based storage provided by Azure (Azure’s Gen2 Data Lake Service (ADLS)).
Adobe Experience Platform Catalog Service provides a way of listing, searching, and provisioning a DataSet, which is our equivalent of a Table in a Relational Database. It is helpful in providing information such as name, description, schema, and applying for permissions, and all metadata recorded on Adobe Experience Platform. As more data is ingested over time, it becomes difficult to query metadata from Catalog. With the introduction of Iceberg, we see a transitional shift in how metadata is captured and recorded.