Expand my Community achievements bar.

Historical Data ingestion (event class) for new additional fields

Avatar

Level 6

When a new field is introduced in the event schema after it already been enabled for the profile and have the data ingested, how do we ingest the historical data for newly introduce field ?

 

Some of the solution i could think of are,

 

option1) One approach I am ware is create a new dataSet with primary key and new field, and then use that dataset to ingest both historical and incremental data. however this approach may not be scalable. if we introduce new field every quarter, then it would lead to multiple datasets.


option2) Wipe out historical data from the dataset by deleting the batch which had ingested that historical data, and re-ingest the data with all attribute. this is time consuming, and can disturb existing prod setup. 

option3) use data distiller to backfill the historical data. can that be done technically as event based schema class doesnt support upsert?

 

Are there any best recommended approach ?

1 Reply

Avatar

Level 4

the best option will be as below

1. add additional attribute in the schema

2. create new dataset from the same schema and profile enable it.

3. take back up of old dataset in non profile enabled dataset

4. ingest historical data, and schedule incremental data into the new dataset including data for additional attribute you have just added. but while ingesting historical data make sure you use a different _id.

5. drop the old dataset. only by dropping the dataset, data will be removed from profile store.

 

option 1 in your question will be having challenges as you mentioned having multiple datasets, and processing incremental data to multiple datasets and so on.

 

Option 2 wont work, as by deleting the batches it will only delete the data from data lake not from profile store

 

option 3 also not possible.