Expand my Community achievements bar.

Join us on September 25th for a must-attend webinar featuring Adobe Experience Maker winner Anish Raul. Discover how leading enterprises are adopting AI into their workflows securely, responsibly, and at scale.

Dataset Activation with Data Distiller in Adobe Experience Platform

Avatar

Employee

10/15/24

In today's data-driven world, efficiently activating datasets is crucial for maximizing business value. Whether it's AI/ML model training, enterprise reporting, or providing a 360-degree view of your customer, Data Distiller in Adobe Experience Platform (AEP) plays a pivotal role in transforming raw data into structured, actionable insights.

Data Distiller allows you to convert raw datasets into derived datasets that are pre-processed, enriched, and ready for immediate use. This process reduces complexity and significantly enhances the performance of data analysis and model training. By structuring data into a star schema, which includes both fact tables (like sales, revenue) and lookup tables (such as customer demographics), businesses can seamlessly leverage pre-aggregated, optimized data for real-time insights.

Key Benefits of Data Distiller

  1. Pre-Processed and Ready for Use: Derived datasets have undergone thorough processing, such as data normalization and feature engineering, ensuring that the dataset is clean and analysis-ready. This minimizes the time spent preparing data, allowing teams to focus on extracting insights and making strategic decisions.
  2. Consistency and Accuracy: With derived datasets, all users work with the same set of pre-calculated metrics and features, promoting consistency across different reports and analyses. This helps eliminate discrepancies that can occur when multiple teams manually process raw data.
  3. Enhanced Performance: Pre-processed data ensures faster query execution and reduced processing time for AI/ML models and dashboards, especially when dealing with large datasets.
  4. Cleaner and More Relevant Data: Derived datasets focus on key features and metrics, removing irrelevant information or noise, resulting in cleaner data that aligns with business goals.
  5. Improved Model Training: AI/ML models benefit from high-quality features included in the derived datasets, leading to better prediction accuracy and faster training times.

Sometimes, exporting datasets in a custom batch audience format may be necessary. Data Distiller supports these special export needs by acting as a contract between AEP and external destinations, ensuring that datasets are structured according to the requirements of the target platform.

Prototyping with Data Landing Zone Destination

In the tutorial below, the Data Landing Zone Destination was used as the central component for prototyping the export of datasets. It served as a staging area where data could be verified before final export, allowing us to quickly check and confirm the content of exported datasets. This streamlined the process of prototyping by providing a reliable method for external systems to access and validate the data.

Azure Storage Explorerwas crucial in understanding the exported data, as it enabled us to browse, view, and download the exported files.

Understanding Data Usage Labeling and Enforcement

In the tutorial, Data Distiller allowed for the application of contract labels such as the C2 label to individual fields within a dataset. The C2 contract label ensures that certain fields cannot be exported to third-party destinations. This mechanism is particularly useful when dealing with datasets that contain sensitive or regulated information. DULE provided the framework to enforce these policies, preventing unauthorized or non-compliant exports, thereby maintaining data integrity and meeting privacy obligations.

Try the Tutorial

The link is here

1 Comment

Avatar

Level 4

2/21/25

Hi @SaurabhM 

great tutorial. The whole book on Data Distiller is very informative and well written, thanks for it!

 

One question, though. 

I tried exporting one of the out-of-the-box datasets: AJO Message Feedback Event Dataset - not a derived dataset yet. I am on a test sandbox, and not a lot of emails were sent out:

select count(*) from ajo_message_feedback_event_dataset -- 181

So very small dataset.

The initial export - the full - was split in 17 files, very small files - 1..5KB in size. Question is: is there a setting or something that can reduce the number of files? Make each file bigger - let's say up to 1 MB - and so creating fewer files? 

I presume that, under real production load, there will be hundreds of small files daily for only one dataset, which sounds a bit like overkill.

 

Any suggestion will be greatly appreciated.