Expand my Community achievements bar.

SOLVED

Does duplicate data present in dataset affect customer AI model predictions?

Avatar

Level 3

I am using a event type dataset for customer AI instance which has a daily data ingest frequency and hence contains duplicate rows of data with latest timestamp. So, will the duplicate records affect my model predictions? 

 

(Judging from the influential factors, I do think the results might be influenced by duplicate records)

Topics

Topics help categorize Community content and increase your ability to discover relevant content.

1 Accepted Solution

Avatar

Correct answer by
Community Advisor

Hello @SahuSa1 

 

AFAIK, it will affect your model predictions because the data is show multiple records for the same event.

 

 


     Manoj
     Find me on LinkedIn

View solution in original post

6 Replies

Avatar

Administrator

@Travis_Jordan @Parvesh_Parmar @DavidRoss91 @vishnuunnikrishnan Request you to please look at this question and share your thoughts.



Kautuk Sahni

Avatar

Administrator

@_Manoj_Kumar_ @nnakirikanti @dhanesh04s @an1989 @renatoz28 @brekrut @ccg1706 @somen-sarkar @saswataghosh @NickMannion Kindly take a moment to review this question and share your valuable insights. Your expertise would be greatly appreciated!



Kautuk Sahni

Avatar

Correct answer by
Community Advisor

Hello @SahuSa1 

 

AFAIK, it will affect your model predictions because the data is show multiple records for the same event.

 

 


     Manoj
     Find me on LinkedIn

Avatar

Level 3

Hi @_Manoj_Kumar_ , thanks for your reply.

 

I also confirmed with Adobe support, they said duplicate data will indeed affect model predictions. And currently there is no de-dupe logic as use cases might varry for different businesses.

 

Thanks

Avatar

Employee

@SahuSa1 

Can I ask why the source is producing duplicate rows of data with the latest timestamp?

 

If the data is not changing why is the data being re-ingested?

Avatar

Level 3

Hi @brekrut , thanks for your reply.

 

Source is set to update its incremental date field to latest date everyday. It was as per client's requirement. Client did not want to loose on old data or users who are inactive for few months (as TTL is applied).

 

New data gets generated everyday and historical data + new data is ingested everyday.

 

Thanks