Re-ingestion of Missing Data Without Causing Duplicates
Hi everyone,
Issue:
Due to certain issues, data from some files (corresponding to specific dates) is missing in AEP. We need to ingest this missing data, but we are concerned about potential data duplication in the dataset.
Some information on the dataflow setup:
- Our source data files are stored in Azure Storage Explorer. These are incremental files, with a new file received daily. Each file is retained in Azure Blob Storage for 7 days before deletion.
- In AEP, we are using the Data Landing Zone (DLZ) and connecting to the Azure source via API.
- A dataflow has been created to handle incremental data loading from Azure to AEP DLZ via API, and it is currently running on a daily schedule.
Current Setup (Example):
- On May 1st, we performed a one-time data load into the dataset via API. After ingestion, we disabled this dataflow.
- On May 5th, we created and activated an incremental dataflow for the same dataset. This flow has been running daily and continues to function without issues.
- However, data from May 2nd to May 4th is missing in AEP.
We’ve been advised to re-ingest data from May 2nd to the current date to ensure data consistency.
(Example: A customer’s phone number might have changed between May 2nd and today.)
If we re-ingest data from May 2nd onwards, will this overlap with already ingested data (from May 5th onwards) and cause duplicates in the dataset?
We want to ensure the dataset remains accurate, up to date, and free of duplicates.
Any guidance on how to safely manage this re-ingestion process would be greatly appreciated.
Thanks,