Expand my Community achievements bar.

Purge lookup dataset before new data load via Data Distiller

Avatar

Level 5

8/27/24

For audience composition, enrichment with a lookup dataset is a useful feature. However to generate the lookup dataset, either an external data load or a scheduled query in Data Distiller is needed to populate the dataset.
Quite often this data is of a "refresh" nature, i.e. the best way to handle is to purge the dataset, and reload the fresh data - as record links may also be deleted in the new load.

Dropping the dataset, and creating a new one with CTAS is not an option, since the reference link for enrichment in the Audience Composition would need to be restored.

 

Why is this feature important to you 

- we have multiple use-case scenarios where we have to enrich a certain segment with extra data for personalization in AJO.
- we have also use-cases where the rule segment builder is insufficient to design the query - a designated more complex query (via query service) is needed to prepare extra data which can be used in to filter on. Adding this extra data to the profile could bloat the profile at the end (if multiple of these scenario's are needed).

 

How would you like the feature to work

- the ability to run a CTAS "replace" query, where the previous inserted batch is deleted before the new batch is inserted

 

Current Behaviour

- batch deletion is only possible via API, and needs an external process to manage. We would like to have an option which is part of Data Distiller and does not require an external process.

1 Comment

Avatar

Employee

9/4/24

First of all, a great question and hence demands an explanation on what is happening and why.

 

Here’s an additional explanation for others to better understand the core issue:

  1. Each time a dataset is created, a new dataset ID is generated.

  2. If a dataset is dropped (deleted) and recreated, it will result in a new dataset ID.

The challenge arises when an AEP application or feature binds itself to a specific dataset ID. In such cases, there’s no way around this issue. For example, this occurs in audience composition or in the former Data Science Workspace, where the dataset ID was bound during training.

This makes it difficult to maintain continuity when dataset IDs are linked to specific processes or jobs.

 

As of today, there isn't a workaround for this, but there are two options to consider:

  1. You can replicate (and even improve) the audience composition flow in Data Distiller audiences, a feature we are launching by the end of September 2024. Since dataset IDs are internal to the scheduled job, this issue will be addressed. The flow requires SQL, but the audiences you create in Real-Time Customer Profile will have the same net effect, as it utilizes the same backend.

  2. Data Distiller will also be introducing record deletes and updates. As part of this, we plan to include a TRUNCATE operation, which will empty the dataset but retain the dataset ID. However, I can’t provide a specific timeline as we need to sort out the data foundation first. We expect to make some feature announcements in 2025