Hi everyone - We are recently implementing batch data ingestion for one of our engagement using OOTB SFTP connection / batch.The source file we are receiving is a full refresh from the source system. The dataset we are loading this file is not UPSERT / Profile enabled. The problem we are facing is that everyday the full data is ingested it keeps appending the records.
We are trying to find out a solution to purge the previous dataload into the dataset before next run.
We need a automated solution without any manual intervention.
Could you please suggest possible solution to achieve that.
Solved! Go to Solution.
Views
Replies
Total Likes
Hello @debswarup82 ,
I completely understand your point.
Now you are using out of box connection to import the batch, but out of box there is no option to schedule a purge on batch. This seems the limitation of product and it might be possible in future. If you really need it, you can reach out there product or support team. This is valid point.
As far as I see, best option to achieve is just use API to import the batch.
1. Just create a script in any language for API.
2. Create a batch for your dataset, then you get the batch Id( for e.g batchId1, you can save this id locally).
3. Import the file with batchId1.
4. Next day or next execution, create a new batch (batchId2)
5. Import the file with batchId2
6. Delete the first file with batchId1(you have saved it locally).
I did not find any API which can return you list of batchIds of a dataset, so you need to save it locally to purge it later.
I also tried to find, if data science workspace (Jupyter nookbook lap) with python script allow to purge the data from data sets. It seems not possible there also. You can read and write in dataset from there but no delete the batch.
Thanks.
Parvesh.
Hello @debswarup82,
As I understand, data is not stored in a relational database in AEP.
As mentioned in the below screenshot, data ingested in a data set is stored in a batch. This is like a file storage system like Hadoop.
PFA: Screenshot.
Document URL :
https://experienceleague.adobe.com/docs/experience-platform/catalog/datasets/overview.html?lang=en
Now when you set up the connection with SFTP basically you set up or select the following objects.
1. Created an SFTP connection account.
2. Selected dataset.
3. Created data flow.
4. Every day when data is imported with the schedule then batch ingestion or batch file is generated(It has a unique batch id).
Now, your requirement is to purge all the data (purge all batch files )from the dataset before import new batch.
You can delete these batches with UI, but you want to automate this one.
If you want to automate, then you need to delete these batches with API as in the below document.
Document Url:
You can schedule this API with a third-party application or develop a custom app to purge these batches automatically. I am not sure if you can use the Jupyter notebook with phyton to automate the deletion of these batches.
You can check the following documents.
Document Url:
But I do not understand, why do you want to purge the data. If you want to use the latest data you can define the date parameter like today's date in your query or selection in the segment then you always get the data from the latest files. You can create your merge policy also or use the default merge policy( Actually it is based on time).
When data is stored in the file system, then normally use the date partition to use the latest data. They do not purge the old data.
Thanks.
Parvesh.
Hi @Parvesh_Parmar - Thanks for your detailed response.
Yes deletion of batches using an API is an option. But, having another third party application to host the service to call this API might be an overhead. Additionally, as every batch creates an unique id it would be difficult to automate the process because somehow those batch ids need to be available automatically in that third party service.
The reason that we want to delete the batch as the source file we are having is a full refresh so everyday same number of records are getting ingested inside the dataset and the dataset size is exponentially increasing day by day.
Is there any other solution you may think of?
Thanks,
Swarup Deb
Views
Replies
Total Likes
Hello @debswarup82 ,
I completely understand your point.
Now you are using out of box connection to import the batch, but out of box there is no option to schedule a purge on batch. This seems the limitation of product and it might be possible in future. If you really need it, you can reach out there product or support team. This is valid point.
As far as I see, best option to achieve is just use API to import the batch.
1. Just create a script in any language for API.
2. Create a batch for your dataset, then you get the batch Id( for e.g batchId1, you can save this id locally).
3. Import the file with batchId1.
4. Next day or next execution, create a new batch (batchId2)
5. Import the file with batchId2
6. Delete the first file with batchId1(you have saved it locally).
I did not find any API which can return you list of batchIds of a dataset, so you need to save it locally to purge it later.
I also tried to find, if data science workspace (Jupyter nookbook lap) with python script allow to purge the data from data sets. It seems not possible there also. You can read and write in dataset from there but no delete the batch.
Thanks.
Parvesh.
Hi @Parvesh_Parmar - Thanks for your response.
Looks like API is only solution as of now but as understood there is quite a bit manual process involved and dependency on external application hosting to carry out this batch deletion using API.We need a more cleaner and fully automated solution.
If you come across any other option please do let me know.
Thanks,
Swarup Deb
Views
Replies
Total Likes
Hi,
Even though right now there may not be a way to automate this thing. We have written automated python script as suggested by @Parvesh_Parmar running in a unix server and have it scheduled using cron job to perform a similar task. However going forward there may be an option to manage the TTL of data within dataset as part of the "Data Hygiene" functionality of AEP. Its in beta stage and also depends on licensing here is a link of the github repo for this feature : experience-platform.en/ttl.md at main · AdobeDocs/experience-platform.en · GitHub. Maybe using this feature we can specify the TTL of the data and post the TTL the data would be removed from the platform. Just a thought.
Regards,
Sayantan
Views
Replies
Total Likes
Views
Likes
Replies