Level 3

Solved

Data Feeds Count Page Views, Visits, and other Metrics Using R Studio

Forum|Forum|1 year ago
February 26, 2025
3 replies
2081 views

Hello!

To keep it simple for now I want to compare counts of page views first.

Here is how I loaded the data. The "\" escape character seemed to be messing up the tribble so I removed it using escape_backslash = TRUE

library(tidyverse)

# read tsv delim
hit_data_df <- read_delim(
  "hit_df.tsv",
  delim = "\t",
  quote = "",
  col_names = FALSE,
  escape_backslash = TRUE,
  na = c("", "NA")
)

# read tsv headers 
headers <- read_delim(
  "column_headers.tsv",
  delim = "\t",
  quote = "",
  col_names = FALSE,
  escape_backslash = TRUE
)


# insert headers in hit_data_df
col_names <- as.character(headers[1, ])
colnames(hit_data_df) <- col_names

Based on the definition for page view count in Data Feeds calculate metrics:(https://experienceleague.adobe.com/en/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-calculate) 'Count the number of rows where a value is in post_pagename or post_page_url'

The line below matched the number of ocurrences 99% in AA wokspace for the same date range (1 hour)

page_name_pv <- hit_data_df %>% select(post_pagename)

Looking back at the definition 'where a value is in post_pagename' I decided to remove NA thinking it would match page view counts

page_name_pv_na_om <- hit_data_df %>% select(post_pagename) %>%
  na.omit()

...but that showed only 20% of the page views I see in AA Workspace for the same hour.

Furthermore it seems I still need to exclude_hit = 0 which will lead to even less counts?

This looks a bit counter intuitive as it kept the zeroes, what I did is to include "0" right?

hit_data_df_ih0 <- hit_df_look %>% 
  filter(exclude_hit == "0")

The below would show 'Y' in all the rows I saw, not sure if there were other values.

  hit_data_df_xh0 <- hit_df_look %>% 
  filter(exclude_hit != "0")

Would be great to know what am I doing wrong or if the data is not good (the reason I was asked to compare).

Thanks!

R

Analytics

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.

Best answer by Harveer_SinghGi1

Could it be that the data frame doesn't show the correct order once in R because 650MB is too much data for R Studio to handle? thanks again for the help!

Hi @rafael_sahagun ,

While reading delimited data whenever you see issues like unexpected values flowing into columns where preset values are expected it is almost always due to the delimiter being present in one of the column values, in your case it seems the tab delimiter (\t) is recorded in one of the values that are returned in the data feed.

I'll suggest you narrow down to row from where you start seeing the unexpected values and the row above should be the one with such values containing tab delimiter in one of values. After identifying the rows you can figure out how to drop that row and keep remaining ones.

Cheers!

Jennifer_Dungan

Community Advisor and Adobe Champion

How much delay do you have on your raw data feeds? If you are running there hourly, and you don't have any delay, there is a good chance that you are missing data that is still processing on the Adobe server. For safety, we use the maximum delay of 120 minutes (2 hours) on our hourly raw data exports to ensure that the data is fully available.

Before I got involved in that initiative, our Data Lake team didn't have any delay and they were constantly short on data, not realizing the there is data processing latency in Adobe that needed to be complete.

R

Rafael_SahagunAuthor

Level 3

Hi @jennifer_dungan the 1 hour feed file is from mid February, so more than a week has gone by. thanks!

Jennifer_Dungan

Community Advisor and Adobe Champion

Hmm so this is a file pulled from older data... that is interesting...

While I don't process our data feeds myself, I work with the people that do... and our data is very very close... with the exclude hit filtering and all...

I am not sure why you are getting such a huge discrepancy....

Just to confirm, the file from mid-Feb, that was pulled recently? You aren't talking about a test file that was pulled in mid-Feb and you are only processing it now?

Harveer_SinghGi1

Community Advisor and Adobe Champion

Hi @rafael_sahagun ,

You should check for data processing delays as @jennifer_dungan suggested. Also, is it a mobile app data, as that could have timestamped hits arriving late to report suite which will update the numbers in AA UI but data feeds export is already done so it won't have such hits in it.

If you are still seeing the discrepancy for data older then 2-3 hours then give this a try,

hit_date_pv <- nrow(hit_data_df[hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5,7,8,9)) && (!is.na(hit_data_df$post_pagename) || !is.na(hit_data_df$post_page_url)),])

Cheers!

R

Rafael_SahagunAuthor

Level 3

Thanks @harveer_singhgi1 !

As confirmed to Jenn, this data is almost 2 weeks old.

I applied what you just shared and got:

Error in hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5, 7, 8, :
'length = 253048' in coercion to 'logical(1)'

Wondering why.

Lets see if I understood what you shared explained in English:

Include any exclude_hit that is a 0

Remove any exclude_hit that is a 5, 7, 8 or 9

Remove NAs from post_pagename or post_page_url

If I understood well, then by only removing the NA's as mentioned before showed only 20% of the page views I see in AA Workspace for the same hour.

Should I leave the NAs, remove the exclude hits 5, 7, 8 or 9 and see what happens? or should NA's must be removed per the definition 'Count the number of rows where a value is in post_pagename or post_page_url'?

Thanks again!

R

Rafael_SahagunAuthor

Level 3

Hi @rafael_sahagun ,

The error you got seems to be related to be related to DT/DTedit packages on R 4.3.0 onward version - https://github.com/rstudio/DT/issues/1095

I'll check how to fix this but the idea is to apply this logic as per the page view calculation documentation,

Include rows only with exclude_hit = 0
Remove rows with hit_source values 5, 7, 8 and 9 (this is to remove data source ingestions, can be skipped)
To get page views count rows where either post_pagename or post_page_url is present

This should give you the correct page view numbers.

Looking at the options you have tried, they all are lacking one or the other thing. Let's take a very small set of rows and try these queries,

Based on the data in table shown there is only 1 valid page views among these 4 hits. As you can see, only selecting page_name also returns rows with NA. Using na.omit works but still returns rows with exclude_hit !=0. Just filtering for exclude_hit will give all NA and Non-NA values of page_name. What you need is a combination of these conditions shown in the last query.

If this is giving you incorrect numbers then I'll suggest you check for the data in the RSID for these things,

Is the RSID receiving any new page view hits (generated via Late arriving hits or Data insertion API) after the data feed exports for that particular hour were done.
Compare the number of unique page name values and check if there is discrepancy there
Narrow down to a particular page name and compare data

Cheers!

Thanks both @harveer_singhgi1 & @jennifer_dungan !

If I change escape_backslash = TRUE to FALSE then it doesn't get removed and results in a 96% match with AAUI page views vs row count of post_pagename (the smaller value is in DataFeeds).

Looking at distinct post_pagename the values make sense and does not have a single backslash apparently even when I didn't remove those.

accept_language (image below) is what prompted me to think I had to remove escape_backslash from the whole tibble/data frame as it looked odd in that column.... would it be that only some dimensions need to have the backslash removed? or not at all?

Sukrity_Wadhwa

Community Manager

Hi @rafael_sahagun,

Were you able to resolve this query with the help of the provided solutions, or do you still need further assistance? Please let us know. If any of the answers were helpful in moving you closer to a resolution, even partially, we encourage you to mark the one that helped the most as the 'Correct Reply.'
Thank you!

Sukrity Wadhwa