Does duplicate data present in dataset affect customer AI model predictions? | Community
Skip to main content
SahuSa1
Level 3
November 19, 2024
Solved

Does duplicate data present in dataset affect customer AI model predictions?

  • November 19, 2024
  • 4 replies
  • 951 views

I am using a event type dataset for customer AI instance which has a daily data ingest frequency and hence contains duplicate rows of data with latest timestamp. So, will the duplicate records affect my model predictions? 

 

(Judging from the influential factors, I do think the results might be influenced by duplicate records)

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by Manoj_Kumar

Hello @sahusa1 

 

AFAIK, it will affect your model predictions because the data is show multiple records for the same event.

 

 

4 replies

kautuk_sahni
Community Manager
Community Manager
November 22, 2024

@travis_jordan @parvesh_parmar @davidross91 @vishnuun Request you to please look at this question and share your thoughts.

Kautuk Sahni
kautuk_sahni
Community Manager
Community Manager
December 12, 2024

@_manoj_kumar_ @nnakirikanti @dhaneshsh2 @avinashwiley @renatoz28 @brekrut @ccg1706 @somen-sarkar @saswataghosh @nickmannion Kindly take a moment to review this question and share your valuable insights. Your expertise would be greatly appreciated!

Kautuk Sahni
Manoj_Kumar
Community Advisor
Manoj_KumarCommunity AdvisorAccepted solution
Community Advisor
December 17, 2024

Hello @sahusa1 

 

AFAIK, it will affect your model predictions because the data is show multiple records for the same event.

 

 

Manoj     Find me on LinkedIn
SahuSa1
SahuSa1Author
Level 3
December 17, 2024

Hi @_manoj_kumar_ , thanks for your reply.

 

I also confirmed with Adobe support, they said duplicate data will indeed affect model predictions. And currently there is no de-dupe logic as use cases might varry for different businesses.

 

Thanks

brekrut
Adobe Employee
Adobe Employee
December 17, 2024

@sahusa1 

Can I ask why the source is producing duplicate rows of data with the latest timestamp?

 

If the data is not changing why is the data being re-ingested?

SahuSa1
SahuSa1Author
Level 3
December 17, 2024

Hi @brekrut , thanks for your reply.

 

Source is set to update its incremental date field to latest date everyday. It was as per client's requirement. Client did not want to loose on old data or users who are inactive for few months (as TTL is applied).

 

New data gets generated everyday and historical data + new data is ingested everyday.

 

Thanks