Expand my Community achievements bar.

Join us for the next Community Q&A Coffee Break on Tuesday April 23, 2024 with Eric Matisoff, Principal Evangelist, Analytics & Data Science, who will join us to discuss all the big news and announcements from Summit 2024!
SOLVED

Discrepancy Between Clickstream Raw Data Feed and Workspace

Avatar

Level 1

Hi!

 

Recently my team and I have been attempting to automate some of our reporting by implementing a pipeline that gathers the clickstream data feed from our warehousing. This is to supplement our general usage of Workspace for ad-hoc analysis. We have noticed a relatively high discrepancy between the two. With things like Visitors and Views being nearly 20% off from the values seen in Workspace. 

 

We have been using these docs to calculate the metrics:

 

Calculate metrics | Adobe Analytics

 

Is there any sort of unlisted preprocessing that occurs between Workspace and the warehouse that would cause this? Or is there any other factor that may be causing the discrepencies? 

 

Thank you! 

1 Accepted Solution

Avatar

Correct answer by
Employee Advisor

@ben_5 The data feeds that you receive are already filtered with bot hits, ensure you apply these filters, 

Hits normally excluded from Adobe Analytics are included in data feeds. Use exclude_hit > 0 to remove excluded hits from queries on raw data. Data sourced data are also included in data feeds. If you want to exclude data sources, exclude all rows with hit_source = 5,7,8,9.

Also, when you are consuming a live data feed, an hourly and daily , when you are comparing that data to workspace for longer date ranges, you can see duplicate visitors in data feeds, however these are deduped in workspace

View solution in original post

7 Replies

Avatar

Community Advisor

Your Raw Data feeds will include all the excluded Bot and Internal IP filtered traffic. You must manually filter this yourself from the raw data (whereas this is removed from Workspace and Data Warehouse for you automatically).

 

The field you need to look for in the Raw Data Feed is exclude_hit

 

https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/data...

 

Avatar

Level 2

I understand you said we have to do whatever exclusions manually. 

Let's say the exclusion is at the Visit-level, and it's by country along with a dimension called Last Touch Channel Detail (see link). How would you go about doing this exclusion in our Adobe Data Feed (ADF)? 

I thought of using Datawarehouse API to set up a request and have the destination set to be somewhere in our Azure Storage. 

But since there is no notion of visit_id in the output of the Datawarehouse API request. How would I be able to tie it back to our Adobe Data Feed, which is already processed and sitting in our Azure database?

 

Avatar

Community Advisor

Raw data is stored is all Hit based.. even when you stitch the data together by Visits or Visitor, the exclusions I am referencing are the hits with "exclude_hit".

 

Normally, Raw Data like this would be ingested into a "Landing Zone" DB of some sort... then it would be run through an ETL process where rows like Exclude_Hit aren't carried over to your clean DB.. or maybe they are put into a separate "excluded hits" table, if for some reason you need access to all the hits that wouldn't be included in Adobe. 

 

Now, if you are running a query to pull back data and want to not include certain Visits or Visitors, this logic would be run on your final "clean" Database, and built into your SQL query logic.. the data would still be available for other queries.

 

If you are using the API (and not raw data feeds), I believe that excluded hits are already removed... since the API is what drives things like Workspaces and Data Warehouse, and non of those include those hits.

 

 

Without knowing what your ETL process has done with raw data makes this hard (plus I am not a DBA, so while I know SQL, I don't use it everyday).. but in "pseudo terms", if I wanted to exclude Visits with Country "X" and Last Touch Channel "Y", then I would query those two fields, identify the VisID and Visit Number combination, then I would exclude those Visitors on those particular visits to pull back data that didn't include those.

Avatar

Community Advisor

See my comment here for Workspace.

 

https://experienceleaguecommunities.adobe.com/t5/adobe-analytics-questions/data-mismatch-between-wor...

 

Basically workspace has some attribution settings that can differ from Adobe regular reports. Linear is often the one to try when trying to match regular reporting. Hope this gets you closer.

 

GLTU

Avatar

Correct answer by
Employee Advisor

@ben_5 The data feeds that you receive are already filtered with bot hits, ensure you apply these filters, 

Hits normally excluded from Adobe Analytics are included in data feeds. Use exclude_hit > 0 to remove excluded hits from queries on raw data. Data sourced data are also included in data feeds. If you want to exclude data sources, exclude all rows with hit_source = 5,7,8,9.

Also, when you are consuming a live data feed, an hourly and daily , when you are comparing that data to workspace for longer date ranges, you can see duplicate visitors in data feeds, however these are deduped in workspace

Avatar

Level 1

Ah, thank you so much! 

That would definitely do it! We are beginning to implement those exclusions and should know if it works soon. I really appreciate the thoughtful response.