Expand my Community achievements bar.

SOLVED

Updates to Record dataset showing multiple entries in Queries console

Avatar

Level 4

Hi,

 

I have a Record behavior dataset created as a lookup for CJA. The schema is simple, just the default "_id" field as the article ID, the article name and author name.

 

After the initial data upload, we found that there're some mistake on the author name field and thus we have to upload certain records again e.g.

 

Initial upload - (_id:123, name:book1, author:john)

Record update upload - (_id:123, name:book1, author:paul)

 

I was expecting the record to be updated based on the _id:123. However, when tried to use Queries for debugging, with the SQL "select * from <lookup_table_name> where _id = 123, it shows both records in the result window.

 

Anyone can share some insights on this?

 

Thanks,

John

 

1 Accepted Solution

Avatar

Correct answer by
Level 3

I guess that is the design, you could think of the AEP data lake similar to all data lakes, where the data is always appended to the dataset, and not updated. 
So, we also have a track of data ingested/changes happened, against the dataset.

 

And query service works as a simple Query layer on the datalake, we could see all the data. However CJA and RTCDP are intelligently picking the latest data on the record datasets, based on the _id.
If the idea is to have only one time ingestion or append to the lookup dataset, it is always advised to drop the dataset and do an ingestion.

 

View solution in original post

3 Replies

Avatar

Community Advisor

Hi @Hey_John  - This is how AEP works, setting a _id as a primary identity does not mean that the old record will be overwritten when a new entry is added, it will still stay in the data lake. When you will query using _id you will see both the records.

 

One solution could be to also use timestamp field and fetch the entry with the latest timestamp.

 

However, In profile store when you search for a profile using the identity namespace it will only give you the latest record based on the timestamp.

Avatar

Level 4

Hi @arpan-garg 

 

Thanks for your reply.

 

Since this is a lookup schema/dataset, which under the Record behavior/type, is it suppose to have a timestamp field like the Event based schema/dataset?

 

From the functional perspective, there is no issue as I can see the updated lookup value in CJA. But as far as I know, AEP data lake charge per total number of records in dataset, so it is a bit odd if a dataset record cannot be updated.

 

Besides, Adobe call the "_id" field as "a unqiue identifier for the record", don't see why it's "unique" if it's allowed to insert multiple record with the same _id value.

 

Hey_John_0-1720163466373.png

 

Is it really by design or it's something Adobe will address in the future?

 

Avatar

Correct answer by
Level 3

I guess that is the design, you could think of the AEP data lake similar to all data lakes, where the data is always appended to the dataset, and not updated. 
So, we also have a track of data ingested/changes happened, against the dataset.

 

And query service works as a simple Query layer on the datalake, we could see all the data. However CJA and RTCDP are intelligently picking the latest data on the record datasets, based on the _id.
If the idea is to have only one time ingestion or append to the lookup dataset, it is always advised to drop the dataset and do an ingestion.