Build Data pipeline with Adobe data feeds | Community
Skip to main content
May 22, 2025
Solved

Build Data pipeline with Adobe data feeds

  • May 22, 2025
  • 2 replies
  • 666 views

Hi Everyone,

 

I would like to check if someone has come across this type of request for data nalysis with Adobe data feed.To achieve this i am trying to build a data pipeline. i have a medallion architecture in mind. And the goal is to have one table at the visit level and aggregated tables as per business requirement. Please share your thoughts/challenges/experience if you have any regarding this. 

 

Thanks

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by pradnya_balvir

Hi @sasikalaes ,

 

Key Design Considerations:

Data Ingestion

  • Format: Adobe feeds are TSV with thousands of columns.
  • Delivery: Often daily, partitioned by date/hour.
  • Tools: Use Spark, Databricks, or Snowflake for scalable parsing and ingestion.
  1. Visit-Level Aggregation
  • Adobe doesn’t explicitly give visits in data feeds, you must:
    • Use visit_num, visit_start_time_gmt, and post_visid_high/low to group hits.
    • Ensure sessionization logic (handling visit timeouts, cross-day visits).
  1. Identity Resolution
  • post_visid_high/low or mcvisid/mid fields used for visitor ID.
  • Cross-device stitching is not out-of-the-box; consider integrating with ECID/CRM IDs if available.
  1. Medallion Architecture
  • Bronze: Raw ingestion + minimal parsing (e.g., data types, partitioning).
  • Silver: Normalize fields, resolve sessions, de-duplicate hits.
  • Gold: Create dimension tables, aggregate for metrics like conversion rate, funnel analysis.
  1. Aggregation Examples
  • Sessions by traffic source.
  • Page views by product category.
  • Time spent on site by user cohort.
  • Custom attribution models for conversions.

Suggested Tools:

 

  • Data Lakehouse: Databricks (Delta Lake), Snowflake, BigQuery

  • Orchestration: Airflow, Azure Data Factory, dbt

  • Storage: S3 / ADLS Gen2 (Bronze/Silver/Gold folders)

  • Analytics: Power BI, Tableau, Looker

  • Schema Evolution: Apache Iceberg or Delta for handling schema changes

 

2 replies

pradnya_balvir
Community Advisor
pradnya_balvirCommunity AdvisorAccepted solution
Community Advisor
May 22, 2025

Hi @sasikalaes ,

 

Key Design Considerations:

Data Ingestion

  • Format: Adobe feeds are TSV with thousands of columns.
  • Delivery: Often daily, partitioned by date/hour.
  • Tools: Use Spark, Databricks, or Snowflake for scalable parsing and ingestion.
  1. Visit-Level Aggregation
  • Adobe doesn’t explicitly give visits in data feeds, you must:
    • Use visit_num, visit_start_time_gmt, and post_visid_high/low to group hits.
    • Ensure sessionization logic (handling visit timeouts, cross-day visits).
  1. Identity Resolution
  • post_visid_high/low or mcvisid/mid fields used for visitor ID.
  • Cross-device stitching is not out-of-the-box; consider integrating with ECID/CRM IDs if available.
  1. Medallion Architecture
  • Bronze: Raw ingestion + minimal parsing (e.g., data types, partitioning).
  • Silver: Normalize fields, resolve sessions, de-duplicate hits.
  • Gold: Create dimension tables, aggregate for metrics like conversion rate, funnel analysis.
  1. Aggregation Examples
  • Sessions by traffic source.
  • Page views by product category.
  • Time spent on site by user cohort.
  • Custom attribution models for conversions.

Suggested Tools:

 

  • Data Lakehouse: Databricks (Delta Lake), Snowflake, BigQuery

  • Orchestration: Airflow, Azure Data Factory, dbt

  • Storage: S3 / ADLS Gen2 (Bronze/Silver/Gold folders)

  • Analytics: Power BI, Tableau, Looker

  • Schema Evolution: Apache Iceberg or Delta for handling schema changes

 

Jennifer_Dungan
Community Advisor and Adobe Champion
Community Advisor and Adobe Champion
May 24, 2025

One more thing to keep in mind.

 

Rad Data feeds have every row of data collected... including rows that have been excluded (bots, internal traffic, malformed data, etc).

 

When processing your raw data, don't forget to check the exclude_hit and make sure that you don't include these rows, or your data will be inflated.

 

 

Also, make sure you are using the "post" version of the data where ever possible.. this is the post-processed version of the data (so your processing rules, vista rules, etc).