Taking Query Optimizations to the Next Level with Iceberg | AEM Community Blog Seeding

Avatar

Avatar
Establish
Community Manager
kautuk_sahni
Community Manager

Likes

1,198 likes

Total Posts

6,375 posts

Correct reply

1,147 solutions
Top badges earned
Establish
Coach
Originator
Contributor 2
Contributor
View profile

Avatar
Establish
Community Manager
kautuk_sahni
Community Manager

Likes

1,198 likes

Total Posts

6,375 posts

Correct reply

1,147 solutions
Top badges earned
Establish
Coach
Originator
Contributor 2
Contributor
View profile
kautuk_sahni
Community Manager

15-01-2021

BlogImage.jpg

Taking Query Optimizations to the Next Level with Iceberg by Gautam P. Kowshik and Xabriel J. Collazo Mojica

Abstract

This blog is the third post of a series on Apache Iceberg at Adobe. In the first blog we gave an overview of the Adobe Experience Platform architecture. We showed how data flows through the Adobe Experience Platform, how the data’s schema is laid out, and also some of the unique challenges that it poses. We also discussed the basics of Apache Iceberg and what makes it a viable solution for our platform. Iceberg today is our de-facto data format for all datasets in our data lake.

We covered issues with ingestion throughput in the previous blog in this series. We will now focus on achieving read performance using Apache Iceberg and compare how Iceberg performed in the initial prototype vs. how it does today and walk through the optimizations we did to make it work for AEP.

Here are some of the challenges we faced, from a read perspective, before Iceberg:
1. Consistency Between Data & Metadata: We had all our metadata in a separate store from our data lake as most big data systems do. This made it hard to do data restatement. Restating thousands of files in a transactional way was difficult and error-prone. Readers were often left in a potentially inconsistent state due to data being compacted or archived aggressively.

2 Metadata Scalability: Our largest tables would easily get bloated with metadata causing query planning to be cumbersome and hard to scale. Task effort planning took many trips to the metadata store and data lake. Listing files on data lake involve a recursive listing of hierarchical directories that took hours for some datasets that had years of historical data.

Read Full Blog

Taking Query Optimizations to the Next Level with Iceberg

Q&A

Please use this thread to ask the related questions.