Expand my Community achievements bar.

Join our product experts for a live Ask Me Anything on November 12th at 8 AM PT about Experiences & Efficiency with AEP Agent Orchestrator & How Agentic AI is Fueling Smarter Testing and Growth!
SOLVED

difference in datalake and profile store size

Avatar

Level 5

i have seen majorly dataLake size is smaller than profile store, and in very few cases dataLake size is larger than profile store.

based on my understanding, one of the reason dataLake size is comparatively is smaller bcz dataLake uses compression technique to store the data, hence the size is smaller that profile store.

And if dataSet is enabled for profile, much after the data ingestion has been done already . this causes very less profile to get ingested into profile store, which causes dataLake size to be higher than profile store.

 

Apart from above two reason, what are the other explanation for these size variance between dataLake and profile store ?

I have attached screenshot for reference.

Topics

Topics help categorize Community content and increase your ability to discover relevant content.

1 Accepted Solution

Avatar

Correct answer by
Level 5

Got this response from adobe support team

 

Data Lake: Stores data in compressed, column formats. Highly optimized for analytics, resulting in smaller on-disk size. *** - Profile Storage: Data is indexed for fast lookups, less compressed, richer in metadata, ready for real-time activation, which means a larger footprint. - Profile Storage: Merges identities and keeps one record per stitched profile. Sometimes ingesting events or attributes from various sources. This can inflate size if the identity graph is large or complex. - Profile Storage: Union schema retains all fields ever enabled for profile, even if deprecated later.

View solution in original post

6 Replies

Avatar

Level 4

Hi @Pradeep-Jaiswal ,

One reason I can think of is - In the data lake, the data is stored in its raw format, containing no structural (schema) information, no metadata and possibly less indexes. Whereas in the Profile Store, the data is accompanied with schema details and metadata information, efficient indexing (for faster access), which makes it heavier.

 

Thanks!

Avatar

Level 4

data lake store data predominantly in parquet format , whereas profile store data in columnar database (could be cosmodb). also identity graph is stored in graph database. this could be the reason for higher storage size in profile store compared with data lake.

Avatar

Administrator

Hi @Pradeep-Jaiswal,

Were you able to resolve this query with the help of the provided solutions, or do you still need further assistance? Please let us know. If any of the answers were helpful in moving you closer to a resolution, even partially, we encourage you to mark the one that helped the most as the 'Correct Reply.'

Thank you!



Sukrity Wadhwa

Avatar

Correct answer by
Level 5

Got this response from adobe support team

 

Data Lake: Stores data in compressed, column formats. Highly optimized for analytics, resulting in smaller on-disk size. *** - Profile Storage: Data is indexed for fast lookups, less compressed, richer in metadata, ready for real-time activation, which means a larger footprint. - Profile Storage: Merges identities and keeps one record per stitched profile. Sometimes ingesting events or attributes from various sources. This can inflate size if the identity graph is large or complex. - Profile Storage: Union schema retains all fields ever enabled for profile, even if deprecated later.

Avatar

Administrator

Thanks @Pradeep-Jaiswal for sharing the update!



Sukrity Wadhwa

Avatar

Level 6

@Pradeep-Jaiswal @itsMeTechy @Abie 

One more activity that can result in difference in sizes across Profile and Data lake is, Data lake especially for Individual Record datasets has multiple update of the records (same record being updated multiple times from source system) and even Upsert records (each ingestion is maintained in the data lake) but he Profile has only the latest snapshot for the record.