Adobe Experience Platform

bitunsen-2022 · 5/31/22

Hello Team,

Need your expert opinion on skipping records while loading data using batch processing. Consider a scenario - a CSV file has millions of demographic records containing name, email, address etc. We need to load the CSV file into a specific dataset in AEP using AWS S3 source connector. The requirement is if the email address is "none@none.com" then, we should not load that record, else we can load the record. There is no ETL tool is present which could filter out those records. In this situation can we skip these faulty records?

Any recommendation on this will be highly appreciated.

Thanks,

Bitun

Denis_Bozonnet · 5/31/22

Hi @bitunsen-2022

If you have specific conditions to would reject records to be loaded in AEP, why would you then include them at the 1st place in the CSV file? As you said, AEP doesn't have much capacity to execute rules during data load except validation rules in the data schema itself.

I understand the concern on data quality that is a general topic on any data management solution, saying that the best practise is to address the data quality as close as the source as possible. addressing it at the destination (AEP or any other data management solution) is the ultimate rampart on data quality but it's like the tree hiding the forest, it doesn't address the issue at the 1st place.

Which ever system is producing the files could execute data quality at time of extraction to ensure only valid records are extracted. They will produce a reject file to start investigation on the invalid records...

On AEP side, I would say, your only solution is data schema validation at best or at worse you would address the data quality via Query Service, will load the data in a "buffer" data set and then you will run a series of queries to validate the data to then ingest in the final data set.... That would work technically but will make the solution very difficult to maintain and for that reason, not really advised to do so.

Please address these quality control at the source itself (at least at the system that produces the CSV, since you know the rules to apply, better to apply them at the extraction side) or further in the data food chain, aka at the source.of data capture.

Thanks

Denis

View solution in original post

Denis_Bozonnet · 5/31/22

Hi @bitunsen-2022

If you have specific conditions to would reject records to be loaded in AEP, why would you then include them at the 1st place in the CSV file? As you said, AEP doesn't have much capacity to execute rules during data load except validation rules in the data schema itself.

I understand the concern on data quality that is a general topic on any data management solution, saying that the best practise is to address the data quality as close as the source as possible. addressing it at the destination (AEP or any other data management solution) is the ultimate rampart on data quality but it's like the tree hiding the forest, it doesn't address the issue at the 1st place.

Which ever system is producing the files could execute data quality at time of extraction to ensure only valid records are extracted. They will produce a reject file to start investigation on the invalid records...

On AEP side, I would say, your only solution is data schema validation at best or at worse you would address the data quality via Query Service, will load the data in a "buffer" data set and then you will run a series of queries to validate the data to then ingest in the final data set.... That would work technically but will make the solution very difficult to maintain and for that reason, not really advised to do so.

Please address these quality control at the source itself (at least at the system that produces the CSV, since you know the rules to apply, better to apply them at the extraction side) or further in the data food chain, aka at the source.of data capture.

Thanks

Denis

bitunsen-2022 · 5/31/22

Hi Denis,

Thanks a lot for your detailed response. I completely agree with all your points when it comes to ETL process and ingesting data from CSV files. I will work with the team who is doing the ETL process and have the data cleansed before sharing the data to us.

On a side note: As you know that AEP also provides connectors to directly get data from various databases as well. For those cases, it is really hard to assume that the data will be completely cleansed, as till date many organizations lack in establishing proper data governance. If AEP can provide conditional "Record skipping" capabilities at least for all these direct data connectors, then I think it will be a great benefit.

Open to hear your thoughts on this.

Thanks again.