Data Feeds Count Page Views, Visits, and other Metrics Using R Studio | Community
Skip to main content
Level 3
February 26, 2025
Solved

Data Feeds Count Page Views, Visits, and other Metrics Using R Studio

  • February 26, 2025
  • 3 replies
  • 2058 views
Hello!
 
To keep it simple for now I want to compare counts of page views first.
 Here is how I loaded the data.  The "\" escape character seemed to be messing up the tribble so I removed it using escape_backslash = TRUE
 

 

library(tidyverse) # read tsv delim hit_data_df <- read_delim( "hit_df.tsv", delim = "\t", quote = "", col_names = FALSE, escape_backslash = TRUE, na = c("", "NA") ) # read tsv headers headers <- read_delim( "column_headers.tsv", delim = "\t", quote = "", col_names = FALSE, escape_backslash = TRUE ) # insert headers in hit_data_df col_names <- as.character(headers[1, ]) colnames(hit_data_df) <- col_names

 

 

Based on the definition for page view count in Data Feeds calculate metrics:(https://experienceleague.adobe.com/en/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-calculate) 'Count the number of rows where a value is in post_pagename or post_page_url'
 
The line below matched the number of ocurrences 99% in AA wokspace for the same date range (1 hour)

 

page_name_pv <- hit_data_df %>% select(post_pagename)

 

 

Looking back at the definition 'where a value is in post_pagename' I decided to remove NA thinking it would match page view counts

 

page_name_pv_na_om <- hit_data_df %>% select(post_pagename) %>% na.omit()

 

 
...but that showed only 20% of the page views I see in AA Workspace for the same hour.
 
Furthermore it seems I still need to exclude_hit = 0 which will lead to even less counts? 
 
This looks a bit counter intuitive as it kept the zeroes, what I did is to include "0" right? 

 

hit_data_df_ih0 <- hit_df_look %>% filter(exclude_hit == "0")​

 

 
The below would show 'Y' in all the rows I saw, not sure if there were other values.
 

 

hit_data_df_xh0 <- hit_df_look %>% filter(exclude_hit != "0")

 

 
Would be great to know what am I doing wrong or if the data is not good (the reason I was asked to compare).
 
Thanks!
 
R
Best answer by Harveer_SinghGi1

Could it be that the data frame doesn't show the correct order once in R because 650MB is too much data for R Studio to handle? thanks again for the help!


Hi @rafael_sahagun ,

While reading delimited data whenever you see issues like unexpected values flowing into columns where preset values are expected it is almost always due to the delimiter being present in one of the column values, in your case it seems the tab delimiter (\t) is recorded in one of the values that are returned in the data feed.

I'll suggest you narrow down to row from where you start seeing the unexpected values and the row above should be the one with such values containing tab delimiter in one of values. After identifying the rows you can figure out how to drop that row and keep remaining ones.

Cheers!

3 replies

Jennifer_Dungan
Community Advisor and Adobe Champion
Community Advisor and Adobe Champion
February 26, 2025

How much delay do you have on your raw data feeds? If you are running there hourly, and you don't have any delay, there is a good chance that you are missing data that is still processing on the Adobe server. For safety, we use the maximum delay of 120 minutes (2 hours) on our hourly raw data exports to ensure that the data is fully available.

 

Before I got involved in that initiative, our Data Lake team didn't have any delay and they were constantly short on data, not realizing the there is data processing latency in Adobe that needed to be complete.

Level 3
February 26, 2025

Hi @jennifer_dungan the 1 hour feed file is from mid February, so more than a week has gone by. thanks! 

Level 3
February 27, 2025

Hmm so this is a file pulled from older data... that is interesting... 

 

While I don't process our data feeds myself, I work with the people that do... and our data is very very close... with the exclude hit filtering and all...

 

I am not sure why you are getting such a huge discrepancy....

 

 

Just to confirm, the file from mid-Feb, that was pulled recently? You aren't talking about a test file that was pulled in mid-Feb and you are only processing it now?


But great point I'll double confirm.

Harveer_SinghGi1
Community Advisor
Community Advisor
February 26, 2025

Hi @rafael_sahagun ,

You should check for data processing delays as @jennifer_dungan suggested. Also, is it a mobile app data, as that could have timestamped hits arriving late to report suite which will update the numbers in AA UI but data feeds export is already done so it won't have such hits in it.

If you are still seeing the discrepancy for data older then 2-3 hours then give this a try,

hit_date_pv <- nrow(hit_data_df[hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5,7,8,9)) && (!is.na(hit_data_df$post_pagename) || !is.na(hit_data_df$post_page_url)),])

Cheers!

Level 3
February 26, 2025

Thanks @harveer_singhgi1 !

 

As confirmed to Jenn, this data is almost 2 weeks old.

 

I applied what you just shared and got:

Error in hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5, 7, 8, :
'length = 253048' in coercion to 'logical(1)'

Wondering why.

 

Lets see if I understood what you shared explained in English: 

Include any exclude_hit that is a 0

Remove any exclude_hit that is  a 5, 7, 8 or 9

Remove NAs from post_pagename or post_page_url

 

If I understood well, then by only removing the NA's as mentioned before showed only 20% of the page views I see in AA Workspace for the same hour.

 

Should I leave the NAs, remove the exclude hits 5, 7, 8 or 9 and see what happens? or should NA's must be removed per the definition 'Count the number of rows where a value is in post_pagename or post_page_url'?

 

Thanks again!


R

 

 

Jennifer_Dungan
Community Advisor and Adobe Champion
Community Advisor and Adobe Champion
February 27, 2025

In our process, we don't try to filter based on a list of specific exclude_hit values... we just include anything that is exclude_hit = 0 (i.e. no exclusion)

 

While many of the exclude_hit values are no longer in use, it's a more robust system to simply include "0" and exclude anything else... if an old value is re-purposed, or a new value is added, you don't have to change any logic.

Sukrity_Wadhwa
Community Manager
Community Manager
April 2, 2025

Hi @rafael_sahagun,

Were you able to resolve this query with the help of the provided solutions, or do you still need further assistance? Please let us know. If any of the answers were helpful in moving you closer to a resolution, even partially, we encourage you to mark the one that helped the most as the 'Correct Reply.'
Thank you!

Sukrity Wadhwa
Level 3
April 4, 2025

Thanks for reminding me @sukrity_wadhwa  and for the help @harveer_singhgi1 & @jennifer_dungan !