コミュニティアチーブメントバーを展開する。

Adobe Experience Platform (AEP) & Apps User Groups are live to Network, learn, and share in your regional locations.
解決済み

Does duplicate data present in dataset affect customer AI model predictions?

Avatar

Level 3

I am using a event type dataset for customer AI instance which has a daily data ingest frequency and hence contains duplicate rows of data with latest timestamp. So, will the duplicate records affect my model predictions? 

 

(Judging from the influential factors, I do think the results might be influenced by duplicate records)

トピック

トピックはコミュニティのコンテンツの分類に役立ち、関連コンテンツを発見する可能性を広げます。

1 受け入れられたソリューション

Avatar

正解者
Community Advisor

Hello @SahuSa1 

 

AFAIK, it will affect your model predictions because the data is show multiple records for the same event.

 

 


     Manoj
     Find me on LinkedIn

元の投稿で解決策を見る

6 返信

Avatar

Administrator

@Travis_Jordan @Parvesh_Parmar @DavidRoss91 @vishnuunnikrishnan Request you to please look at this question and share your thoughts.



Kautuk Sahni

Avatar

Administrator

@_Manoj_Kumar_ @nnakirikanti @dhanesh04s @an1989 @renatoz28 @brekrut @ccg1706 @somen-sarkar @saswataghosh @NickMannion Kindly take a moment to review this question and share your valuable insights. Your expertise would be greatly appreciated!



Kautuk Sahni

Avatar

正解者
Community Advisor

Hello @SahuSa1 

 

AFAIK, it will affect your model predictions because the data is show multiple records for the same event.

 

 


     Manoj
     Find me on LinkedIn

Avatar

Level 3

Hi @_Manoj_Kumar_ , thanks for your reply.

 

I also confirmed with Adobe support, they said duplicate data will indeed affect model predictions. And currently there is no de-dupe logic as use cases might varry for different businesses.

 

Thanks

Avatar

Employee

@SahuSa1 

Can I ask why the source is producing duplicate rows of data with the latest timestamp?

 

If the data is not changing why is the data being re-ingested?

Avatar

Level 3

Hi @brekrut , thanks for your reply.

 

Source is set to update its incremental date field to latest date everyday. It was as per client's requirement. Client did not want to loose on old data or users who are inactive for few months (as TTL is applied).

 

New data gets generated everyday and historical data + new data is ingested everyday.

 

Thanks