Expand my Community achievements bar.

Data Feeds Count Page Views, Visits, and other Metrics Using R Studio

Avatar

Level 3
Hello!
 
To keep it simple for now I want to compare counts of page views first.
 Here is how I loaded the data.  The "\" escape character seemed to be messing up the tribble so I removed it using escape_backslash = TRUE
 

 

library(tidyverse)

# read tsv delim
hit_data_df <- read_delim(
  "hit_df.tsv",
  delim = "\t",
  quote = "",
  col_names = FALSE,
  escape_backslash = TRUE,
  na = c("", "NA")
)

# read tsv headers 
headers <- read_delim(
  "column_headers.tsv",
  delim = "\t",
  quote = "",
  col_names = FALSE,
  escape_backslash = TRUE
)


# insert headers in hit_data_df
col_names <- as.character(headers[1, ])
colnames(hit_data_df) <- col_names

 

 

Based on the definition for page view count in Data Feeds calculate metrics:(https://experienceleague.adobe.com/en/docs/analytics/export/analytics-data-feed/data-feed-contents/d...) 'Count the number of rows where a value is in post_pagename or post_page_url'
 
The line below matched the number of ocurrences 99% in AA wokspace for the same date range (1 hour)

 

page_name_pv <- hit_data_df %>% select(post_pagename)

 

 

Looking back at the definition 'where a value is in post_pagename' I decided to remove NA thinking it would match page view counts

 

page_name_pv_na_om <- hit_data_df %>% select(post_pagename) %>%
  na.omit()

 

 
...but that showed only 20% of the page views I see in AA Workspace for the same hour.
 
Furthermore it seems I still need to exclude_hit = 0 which will lead to even less counts? 
 
This looks a bit counter intuitive as it kept the zeroes, what I did is to include "0" right? 

 

hit_data_df_ih0 <- hit_df_look %>% 
  filter(exclude_hit == "0")​

 

 
The below would show 'Y' in all the rows I saw, not sure if there were other values.
 

 

  hit_data_df_xh0 <- hit_df_look %>% 
  filter(exclude_hit != "0")

 

 
Would be great to know what am I doing wrong or if the data is not good (the reason I was asked to compare).
 
Thanks!
 
R
14 Replies

Avatar

Community Advisor and Adobe Champion

How much delay do you have on your raw data feeds? If you are running there hourly, and you don't have any delay, there is a good chance that you are missing data that is still processing on the Adobe server. For safety, we use the maximum delay of 120 minutes (2 hours) on our hourly raw data exports to ensure that the data is fully available.

 

Before I got involved in that initiative, our Data Lake team didn't have any delay and they were constantly short on data, not realizing the there is data processing latency in Adobe that needed to be complete.

Avatar

Level 3

Hi @Jennifer_Dungan the 1 hour feed file is from mid February, so more than a week has gone by. thanks! 

Avatar

Community Advisor and Adobe Champion

Hmm so this is a file pulled from older data... that is interesting... 

 

While I don't process our data feeds myself, I work with the people that do... and our data is very very close... with the exclude hit filtering and all...

 

I am not sure why you are getting such a huge discrepancy....

 

 

Just to confirm, the file from mid-Feb, that was pulled recently? You aren't talking about a test file that was pulled in mid-Feb and you are only processing it now?

Avatar

Level 3

Thanks @Jennifer_Dungan 

The tsv is from Feb 13th and occurrences match a little above 99% so that's partial good news, suggesting the data is settled by now. The issue is getting into other metrics.

Avatar

Level 3

But great point I'll double confirm.

Avatar

Community Advisor

Hi @Rafael_Sahagun ,

You should check for data processing delays as @Jennifer_Dungan suggested. Also, is it a mobile app data, as that could have timestamped hits arriving late to report suite which will update the numbers in AA UI but data feeds export is already done so it won't have such hits in it.

If you are still seeing the discrepancy for data older then 2-3 hours then give this a try,

hit_date_pv <- nrow(hit_data_df[hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5,7,8,9)) && (!is.na(hit_data_df$post_pagename) || !is.na(hit_data_df$post_page_url)),])

Cheers!

Avatar

Level 3

Thanks @Harveer_SinghGi1 !

 

As confirmed to Jenn, this data is almost 2 weeks old.

 

I applied what you just shared and got:

Error in hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5, 7, 8, :
'length = 253048' in coercion to 'logical(1)'

Wondering why.

 

Lets see if I understood what you shared explained in English: 

Include any exclude_hit that is a 0

Remove any exclude_hit that is  a 5, 7, 8 or 9

Remove NAs from post_pagename or post_page_url

 

If I understood well, then by only removing the NA's as mentioned before showed only 20% of the page views I see in AA Workspace for the same hour.

 

Should I leave the NAs, remove the exclude hits 5, 7, 8 or 9 and see what happens? or should NA's must be removed per the definition 'Count the number of rows where a value is in post_pagename or post_page_url'?

 

Thanks again!


R

 

 

Avatar

Community Advisor and Adobe Champion

In our process, we don't try to filter based on a list of specific exclude_hit values... we just include anything that is exclude_hit = 0 (i.e. no exclusion)

 

While many of the exclude_hit values are no longer in use, it's a more robust system to simply include "0" and exclude anything else... if an old value is re-purposed, or a new value is added, you don't have to change any logic.

Avatar

Community Advisor

Hi @Rafael_Sahagun ,

The error you got seems to be related to be related to DT/DTedit packages on R 4.3.0 onward version - https://github.com/rstudio/DT/issues/1095

I'll check how to fix this but the idea is to apply this logic as per the page view calculation documentation,

  • Include rows only with exclude_hit = 0
  • Remove rows with hit_source values 5, 7, 8 and 9 (this is to remove data source ingestions, can be skipped)
  • To get page views count rows where either post_pagename or post_page_url is present

This should give you the correct page view numbers.

Looking at the options you have tried, they all are lacking one or the other thing. Let's take a very small set of rows and try these queries,

Harveer_SinghGi1_0-1740637812925.png

Based on the data in table shown there is only 1 valid page views among these 4 hits. As you can see, only selecting page_name also returns rows with NA. Using na.omit works but still returns rows with exclude_hit !=0. Just filtering for exclude_hit will give all NA and Non-NA values of page_name. What you need is a combination of these conditions shown in the last query.

If this is giving you incorrect numbers then I'll suggest you check for the data in the RSID for these things,

  • Is the RSID receiving any new page view hits (generated via Late arriving hits or Data insertion API) after the data feed exports for that particular hour were done.
  • Compare the number of unique page name values and check if there is discrepancy there
  • Narrow down to a particular page name and compare data

Cheers!

Avatar

Level 3

Thanks both @Harveer_SinghGi1 & @Jennifer_Dungan !

 

If I change  escape_backslash = TRUE to FALSE then it doesn't get removed and results in a 96% match with AAUI page views vs row count of post_pagename (the smaller value is in DataFeeds).

Looking at distinct post_pagename the values make sense and does not have a single backslash apparently even when I didn't remove those.

accept_language (image below) is what prompted me to think I had to remove escape_backslash from the whole tibble/data frame as it looked odd in that column.... would it be that only some dimensions need to have the backslash removed? or not at all?

 

Screenshot 2025-02-27 at 9.15.41 AM.png

Avatar

Level 3

Hi @Harveer_SinghGi1 ,

So is that '\' character unusual to see in AA data feeds exports, as the one in the screenshot I shared?

Regardless if set escape_backslash = TRUE or FALSE I see values that seem to belong to other columns not sure yet of the %, wondering if that is a red flag, also if it is a problem with the way I'm getting the data to R Studio or R failing itself...

Thanks a lot!

 

 

Avatar

Level 3

Could it be that the data frame doesn't show the correct order once in R because 650MB is too much data for R Studio to handle? thanks again for the help!

Avatar

Community Advisor

Hi @Rafael_Sahagun ,

While reading delimited data whenever you see issues like unexpected values flowing into columns where preset values are expected it is almost always due to the delimiter being present in one of the column values, in your case it seems the tab delimiter (\t) is recorded in one of the values that are returned in the data feed.

I'll suggest you narrow down to row from where you start seeing the unexpected values and the row above should be the one with such values containing tab delimiter in one of values. After identifying the rows you can figure out how to drop that row and keep remaining ones.

Cheers!

Thanks @Harveer_SinghGi1 

I'll look into that!

Wondering if the issue would have been avoided by checking the box when setting data feeds up?

 

 Screenshot 2025-02-27 at 5.22.36 PM.png