Hi,
I am ingesting some data from Amazon S3 bucket every 15 mins and i noticed that the data which is already ingested is getting ingested again and again as a separate row in the dataset.
Is there a way to make sure that data already ingested is not ingested again?
Solved! Go to Solution.
Views
Replies
Total Likes
Hi @arpan-garg, that says the reason for duplication.
Since the file is getting modified, it loads the file again and ingests the records since you cannot filter based on a field. See the note below:
The best way to handle this is as follows:
Views
Replies
Total Likes
Hi @arpan-garg, quick questions :
Here's an important thing to look for - https://experienceleague.adobe.com/docs/experience-platform/sources/ui-tutorials/dataflow/cloud-stor...
Thanks,
Chetanya
Hi @ChetanyaJain - Yes, i selected a particular file while ingesting data.
An existing file gets updated with the new values but it also contains the old values.
One things to notice is when ingesting data via S3 bucket we don't see an option for selecting a incremental load field so it backfills all the data again.
Views
Replies
Total Likes
Hi @arpan-garg, that says the reason for duplication.
Since the file is getting modified, it loads the file again and ingests the records since you cannot filter based on a field. See the note below:
The best way to handle this is as follows:
Views
Replies
Total Likes
Thank you, @ChetanyaJain , for the information provided. I will definitely give it a try. Based on the information, it appears that data cleaning should be handled separately, and only newly modified or updated data should be included in the new file. If a file with new timestamp contains the old data as well, AEP will ingest everything again.
Views
Replies
Total Likes
That's right, any record that comes in it will re-ingest. So there are times when you must plan the events ingestion very well. Once added, they cannot be modified.
So data preparation/preprocessing is necessary before ingesting the data? Dear Chetanya, is there any recommended approach to automate this? Any tools that could find the diff and create the necessary input?
Views
Replies
Total Likes
Views
Likes
Replies