Updates to Record dataset showing multiple entries in Queries console | Community
Skip to main content
John_Man
Community Advisor
Community Advisor
July 4, 2024
Solved

Updates to Record dataset showing multiple entries in Queries console

  • July 4, 2024
  • 1 reply
  • 1220 views

Hi,

 

I have a Record behavior dataset created as a lookup for CJA. The schema is simple, just the default "_id" field as the article ID, the article name and author name.

 

After the initial data upload, we found that there're some mistake on the author name field and thus we have to upload certain records again e.g.

 

Initial upload - (_id:123, name:book1, author:john)

Record update upload - (_id:123, name:book1, author:paul)

 

I was expecting the record to be updated based on the _id:123. However, when tried to use Queries for debugging, with the SQL "select * from <lookup_table_name> where _id = 123, it shows both records in the result window.

 

Anyone can share some insights on this?

 

Thanks,

John

 

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by sreeCharan73

I guess that is the design, you could think of the AEP data lake similar to all data lakes, where the data is always appended to the dataset, and not updated. 
So, we also have a track of data ingested/changes happened, against the dataset.

 

And query service works as a simple Query layer on the datalake, we could see all the data. However CJA and RTCDP are intelligently picking the latest data on the record datasets, based on the _id.
If the idea is to have only one time ingestion or append to the lookup dataset, it is always advised to drop the dataset and do an ingestion.

 

1 reply

arpan-garg
Community Advisor
Community Advisor
July 5, 2024

Hi @john_man  - This is how AEP works, setting a _id as a primary identity does not mean that the old record will be overwritten when a new entry is added, it will still stay in the data lake. When you will query using _id you will see both the records.

 

One solution could be to also use timestamp field and fetch the entry with the latest timestamp.

 

However, In profile store when you search for a profile using the identity namespace it will only give you the latest record based on the timestamp.

John_Man
Community Advisor
John_ManCommunity AdvisorAuthor
Community Advisor
July 5, 2024

Hi @arpan-garg 

 

Thanks for your reply.

 

Since this is a lookup schema/dataset, which under the Record behavior/type, is it suppose to have a timestamp field like the Event based schema/dataset?

 

From the functional perspective, there is no issue as I can see the updated lookup value in CJA. But as far as I know, AEP data lake charge per total number of records in dataset, so it is a bit odd if a dataset record cannot be updated.

 

Besides, Adobe call the "_id" field as "a unqiue identifier for the record", don't see why it's "unique" if it's allowed to insert multiple record with the same _id value.

 

 

Is it really by design or it's something Adobe will address in the future?

 

sreeCharan73
sreeCharan73Accepted solution
Level 2
July 10, 2024

I guess that is the design, you could think of the AEP data lake similar to all data lakes, where the data is always appended to the dataset, and not updated. 
So, we also have a track of data ingested/changes happened, against the dataset.

 

And query service works as a simple Query layer on the datalake, we could see all the data. However CJA and RTCDP are intelligently picking the latest data on the record datasets, based on the _id.
If the idea is to have only one time ingestion or append to the lookup dataset, it is always advised to drop the dataset and do an ingestion.