Data Feeds Count Page Views, Visits, and other Metrics Using R Studio | Community
Skip to main content
Level 3
February 26, 2025
Solved

Data Feeds Count Page Views, Visits, and other Metrics Using R Studio

  • February 26, 2025
  • 3 replies
  • 2081 views
Hello!
 
To keep it simple for now I want to compare counts of page views first.
 Here is how I loaded the data.  The "\" escape character seemed to be messing up the tribble so I removed it using escape_backslash = TRUE
 

 

library(tidyverse) # read tsv delim hit_data_df <- read_delim( "hit_df.tsv", delim = "\t", quote = "", col_names = FALSE, escape_backslash = TRUE, na = c("", "NA") ) # read tsv headers headers <- read_delim( "column_headers.tsv", delim = "\t", quote = "", col_names = FALSE, escape_backslash = TRUE ) # insert headers in hit_data_df col_names <- as.character(headers[1, ]) colnames(hit_data_df) <- col_names

 

 

Based on the definition for page view count in Data Feeds calculate metrics:(https://experienceleague.adobe.com/en/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-calculate) 'Count the number of rows where a value is in post_pagename or post_page_url'
 
The line below matched the number of ocurrences 99% in AA wokspace for the same date range (1 hour)

 

page_name_pv <- hit_data_df %>% select(post_pagename)

 

 

Looking back at the definition 'where a value is in post_pagename' I decided to remove NA thinking it would match page view counts

 

page_name_pv_na_om <- hit_data_df %>% select(post_pagename) %>% na.omit()

 

 
...but that showed only 20% of the page views I see in AA Workspace for the same hour.
 
Furthermore it seems I still need to exclude_hit = 0 which will lead to even less counts? 
 
This looks a bit counter intuitive as it kept the zeroes, what I did is to include "0" right? 

 

hit_data_df_ih0 <- hit_df_look %>% filter(exclude_hit == "0")​

 

 
The below would show 'Y' in all the rows I saw, not sure if there were other values.
 

 

hit_data_df_xh0 <- hit_df_look %>% filter(exclude_hit != "0")

 

 
Would be great to know what am I doing wrong or if the data is not good (the reason I was asked to compare).
 
Thanks!
 
R
This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by Harveer_SinghGi1

Could it be that the data frame doesn't show the correct order once in R because 650MB is too much data for R Studio to handle? thanks again for the help!


Hi @rafael_sahagun ,

While reading delimited data whenever you see issues like unexpected values flowing into columns where preset values are expected it is almost always due to the delimiter being present in one of the column values, in your case it seems the tab delimiter (\t) is recorded in one of the values that are returned in the data feed.

I'll suggest you narrow down to row from where you start seeing the unexpected values and the row above should be the one with such values containing tab delimiter in one of values. After identifying the rows you can figure out how to drop that row and keep remaining ones.

Cheers!

3 replies

Jennifer_Dungan
Community Advisor and Adobe Champion
Community Advisor and Adobe Champion
February 26, 2025

How much delay do you have on your raw data feeds? If you are running there hourly, and you don't have any delay, there is a good chance that you are missing data that is still processing on the Adobe server. For safety, we use the maximum delay of 120 minutes (2 hours) on our hourly raw data exports to ensure that the data is fully available.

 

Before I got involved in that initiative, our Data Lake team didn't have any delay and they were constantly short on data, not realizing the there is data processing latency in Adobe that needed to be complete.

Level 3
February 26, 2025

Hi @jennifer_dungan the 1 hour feed file is from mid February, so more than a week has gone by. thanks! 

Jennifer_Dungan
Community Advisor and Adobe Champion
Community Advisor and Adobe Champion
February 27, 2025

Hmm so this is a file pulled from older data... that is interesting... 

 

While I don't process our data feeds myself, I work with the people that do... and our data is very very close... with the exclude hit filtering and all...

 

I am not sure why you are getting such a huge discrepancy....

 

 

Just to confirm, the file from mid-Feb, that was pulled recently? You aren't talking about a test file that was pulled in mid-Feb and you are only processing it now?

Harveer_SinghGi1
Community Advisor and Adobe Champion
Community Advisor and Adobe Champion
February 26, 2025

Hi @rafael_sahagun ,

You should check for data processing delays as @jennifer_dungan suggested. Also, is it a mobile app data, as that could have timestamped hits arriving late to report suite which will update the numbers in AA UI but data feeds export is already done so it won't have such hits in it.

If you are still seeing the discrepancy for data older then 2-3 hours then give this a try,

hit_date_pv <- nrow(hit_data_df[hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5,7,8,9)) && (!is.na(hit_data_df$post_pagename) || !is.na(hit_data_df$post_page_url)),])

Cheers!

Level 3
February 26, 2025

Thanks @harveer_singhgi1 !

 

As confirmed to Jenn, this data is almost 2 weeks old.

 

I applied what you just shared and got:

Error in hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5, 7, 8, :
'length = 253048' in coercion to 'logical(1)'

Wondering why.

 

Lets see if I understood what you shared explained in English: 

Include any exclude_hit that is a 0

Remove any exclude_hit that is  a 5, 7, 8 or 9

Remove NAs from post_pagename or post_page_url

 

If I understood well, then by only removing the NA's as mentioned before showed only 20% of the page views I see in AA Workspace for the same hour.

 

Should I leave the NAs, remove the exclude hits 5, 7, 8 or 9 and see what happens? or should NA's must be removed per the definition 'Count the number of rows where a value is in post_pagename or post_page_url'?

 

Thanks again!


R

 

 

Level 3
February 28, 2025

Thanks both @harveer_singhgi1 & @jennifer_dungan !

 

If I change  escape_backslash = TRUE to FALSE then it doesn't get removed and results in a 96% match with AAUI page views vs row count of post_pagename (the smaller value is in DataFeeds).

Looking at distinct post_pagename the values make sense and does not have a single backslash apparently even when I didn't remove those.

accept_language (image below) is what prompted me to think I had to remove escape_backslash from the whole tibble/data frame as it looked odd in that column.... would it be that only some dimensions need to have the backslash removed? or not at all?

 


Hi @harveer_singhgi1 ,

So is that '\' character unusual to see in AA data feeds exports, as the one in the screenshot I shared?

Regardless if set escape_backslash = TRUE or FALSE I see values that seem to belong to other columns not sure yet of the %, wondering if that is a red flag, also if it is a problem with the way I'm getting the data to R Studio or R failing itself...

Thanks a lot!

 

 

Sukrity_Wadhwa
Community Manager
Community Manager
April 2, 2025

Hi @rafael_sahagun,

Were you able to resolve this query with the help of the provided solutions, or do you still need further assistance? Please let us know. If any of the answers were helpful in moving you closer to a resolution, even partially, we encourage you to mark the one that helped the most as the 'Correct Reply.'
Thank you!

Sukrity Wadhwa
Level 3
April 4, 2025

Thanks for reminding me @sukrity_wadhwa  and for the help @harveer_singhgi1 & @jennifer_dungan !