Skipping a record based on specific condition on a field during batch data load

Question

Hello Team,

Need your expert opinion on skipping records while loading data using batch processing. Consider a scenario - a CSV file has millions of demographic records containing name, email, address etc. We need to load the CSV file into a specific dataset in AEP using AWS S3 source connector. The requirement is if the email address is "none@none.com" then, we should not load that record, else we can load the record. There is no ETL tool is present which could filter out those records. In this situation can we skip these faulty records?

Any recommendation on this will be highly appreciated.

Thanks,

Bitun

costa_n11 · Accepted Answer

Hi @bitunsen-2022

If you have specific conditions to would reject records to be loaded in AEP, why would you then include them at the 1st place in the CSV file? As you said, AEP doesn't have much capacity to execute rules during data load except validation rules in the data schema itself.

I understand the concern on data quality that is a general topic on any data management solution, saying that the best practise is to address the data quality as close as the source as possible. addressing it at the destination (AEP or any other data management solution) is the ultimate rampart on data quality but it's like the tree hiding the forest, it doesn't address the issue at the 1st place.

Which ever system is producing the files could execute data quality at time of extraction to ensure only valid records are extracted. They will produce a reject file to start investigation on the invalid records...

On AEP side, I would say, your only solution is data schema validation at best or at worse you would address the data quality via Query Service, will load the data in a "buffer" data set and then you will run a series of queries to validate the data to then ingest in the final data set.... That would work technically but will make the solution very difficult to maintain and for that reason, not really advised to do so.

Please address these quality control at the source itself (at least at the system that produces the CSV, since you know the rules to apply, better to apply them at the extraction side) or further in the data food chain, aka at the source.of data capture.

Thanks

Denis

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded