Expand my Community achievements bar.

SOLVED

Best Practice - New Data point additon on the existing schema & Datasets

Avatar

Community Advisor

Team - Looking for the best pratice what you follow while adding new data point on the existing schema. Lets say, we have a existing schema and dataset where the data have been flowing into the datalake as excpected. There are scenario where we need to add couple of data points into the schema and eventually we also need to ingest those data points in to the AEP datalake through dataset. So would like to know what approach you follow to ingest the new data point that have been added newly in to the schema.

1. Create a new data and ingest the addtional data point along with primary identifier - This is the approach we follow currently and it works fine as expected.
2. Ingesting the data point on the existing dataset with primary identifier - Data got ingested successfully and I could see the profile count as well for the couple of data points under the segment. However, we couldn't see the data on the query service and that make us to think that technically it has some issue on ingesting on exisitng data set.

Would like to hear more from you? please let me know if any.

1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

Hi @jayakrishnaaparthasarathy 

I would say that with Option 1, you would also modify the existing collection to pass all fields at once (new and old) going forward. meaning that when you get data into the data set, you would have the whole record. if via option 1, you have 2 data collection one for old fields and one for new fields, then what you observed is correct as there would be 2 records in the lake to represent the whole set of information, meaning that in Query Service, you would have to merge the records to get a single view of the information. For that reason, option 1 is good if you can have a single data collection for the whole dataset: old + new information at once. 

 

Option 2 would work if you have distinct data collections set for the old and new information which is probably your use case....

 

Unified Profile will merge the underlying datasets based on the merge rules (default being timebound) to provide the single view....

 

If my understanding of your deployment use case is correct, option 2 is probbaly the most viable.

Hope this helps

Denis

View solution in original post

4 Replies

Avatar

Employee Advisor

Hi @jayakrishnaaparthasarathy 

 

To sum up you added new fields on an existing schema and you've done 2 ways for the dataset

  1. Adding the fields in an existing dataset
  2. Creating a new dataset which just the added fields

Am I correct?

Remember that data in AEP Data Lake are on insertion basis, so depending on how you populate your dataset (in scenario 1), you would certainly not see ALL values for a given batch. let say that in scenario 1, you have 2 entry point which populate different part of the dataset all using the same primary identity, you would have 2 batches with each partial information (with different dates of injection too), so in the schema you will en dup to have 2 "records" to get the full information. it's only on the Unified Profile that you will see all the information for a given primary identity,

 

What it means also is that in Query Service, in scenario 1, you would return 2 records for a given primary identity with each partial information.

 

Does that fit what you observe?

Thanks

Denis

 

Avatar

Community Advisor

@Denis_Bozonnet  

thank you for your reply. Yes, you are right. I have tried with both the approach of adding the new data points in to the AEP data lake. I am observing the same scenarion as you mentioned on the approach2. However, on the approach2 the "record" (Datasets) count of the specific schema will keep on summing up whenever we have added new data points as per the business need. 
What we observed also on the approch1 is that, we couldn't see the data point value  on the query service but as mentioned the profile count is getting under the segment and could also see the value in datasets PREVIEW.

Therefore, we have decided to take the approach2 as we suspect that might be technically has some issue on the approach1. What would be good suggestion on adding the new fields? just wondering.

@Denis_Bozonnet  - Looking forward to hear back from you. So we finalise on adding the data point accordingly. Thanks much!

Avatar

Correct answer by
Employee Advisor

Hi @jayakrishnaaparthasarathy 

I would say that with Option 1, you would also modify the existing collection to pass all fields at once (new and old) going forward. meaning that when you get data into the data set, you would have the whole record. if via option 1, you have 2 data collection one for old fields and one for new fields, then what you observed is correct as there would be 2 records in the lake to represent the whole set of information, meaning that in Query Service, you would have to merge the records to get a single view of the information. For that reason, option 1 is good if you can have a single data collection for the whole dataset: old + new information at once. 

 

Option 2 would work if you have distinct data collections set for the old and new information which is probably your use case....

 

Unified Profile will merge the underlying datasets based on the merge rules (default being timebound) to provide the single view....

 

If my understanding of your deployment use case is correct, option 2 is probbaly the most viable.

Hope this helps

Denis