library(tidyverse)
# read tsv delim
hit_data_df <- read_delim(
"hit_df.tsv",
delim = "\t",
quote = "",
col_names = FALSE,
escape_backslash = TRUE,
na = c("", "NA")
)
# read tsv headers
headers <- read_delim(
"column_headers.tsv",
delim = "\t",
quote = "",
col_names = FALSE,
escape_backslash = TRUE
)
# insert headers in hit_data_df
col_names <- as.character(headers[1, ])
colnames(hit_data_df) <- col_names
page_name_pv <- hit_data_df %>% select(post_pagename)
page_name_pv_na_om <- hit_data_df %>% select(post_pagename) %>%
na.omit()
hit_data_df_ih0 <- hit_df_look %>%
filter(exclude_hit == "0")
hit_data_df_xh0 <- hit_df_look %>%
filter(exclude_hit != "0")
Views
Replies
Total Likes
How much delay do you have on your raw data feeds? If you are running there hourly, and you don't have any delay, there is a good chance that you are missing data that is still processing on the Adobe server. For safety, we use the maximum delay of 120 minutes (2 hours) on our hourly raw data exports to ensure that the data is fully available.
Before I got involved in that initiative, our Data Lake team didn't have any delay and they were constantly short on data, not realizing the there is data processing latency in Adobe that needed to be complete.
Views
Replies
Total Likes
Hi @Jennifer_Dungan the 1 hour feed file is from mid February, so more than a week has gone by. thanks!
Hmm so this is a file pulled from older data... that is interesting...
While I don't process our data feeds myself, I work with the people that do... and our data is very very close... with the exclude hit filtering and all...
I am not sure why you are getting such a huge discrepancy....
Just to confirm, the file from mid-Feb, that was pulled recently? You aren't talking about a test file that was pulled in mid-Feb and you are only processing it now?
Thanks @Jennifer_Dungan
The tsv is from Feb 13th and occurrences match a little above 99% so that's partial good news, suggesting the data is settled by now. The issue is getting into other metrics.
Views
Replies
Total Likes
But great point I'll double confirm.
Views
Replies
Total Likes
Hi @Rafael_Sahagun ,
You should check for data processing delays as @Jennifer_Dungan suggested. Also, is it a mobile app data, as that could have timestamped hits arriving late to report suite which will update the numbers in AA UI but data feeds export is already done so it won't have such hits in it.
If you are still seeing the discrepancy for data older then 2-3 hours then give this a try,
hit_date_pv <- nrow(hit_data_df[hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5,7,8,9)) && (!is.na(hit_data_df$post_pagename) || !is.na(hit_data_df$post_page_url)),])
Cheers!
Views
Replies
Total Likes
Thanks @Harveer_SinghGi1 !
As confirmed to Jenn, this data is almost 2 weeks old.
I applied what you just shared and got:
Error in hit_data_df$exclude_hit == 0 && !(hit_data_df$hit_source %in% c(5, 7, 8, :
'length = 253048' in coercion to 'logical(1)'
Wondering why.
Lets see if I understood what you shared explained in English:
Include any exclude_hit that is a 0
Remove any exclude_hit that is a 5, 7, 8 or 9
Remove NAs from post_pagename or post_page_url
If I understood well, then by only removing the NA's as mentioned before showed only 20% of the page views I see in AA Workspace for the same hour.
Should I leave the NAs, remove the exclude hits 5, 7, 8 or 9 and see what happens? or should NA's must be removed per the definition 'Count the number of rows where a value is in post_pagename or post_page_url'?
Thanks again!
R
Views
Replies
Total Likes
In our process, we don't try to filter based on a list of specific exclude_hit values... we just include anything that is exclude_hit = 0 (i.e. no exclusion)
While many of the exclude_hit values are no longer in use, it's a more robust system to simply include "0" and exclude anything else... if an old value is re-purposed, or a new value is added, you don't have to change any logic.
Hi @Rafael_Sahagun ,
The error you got seems to be related to be related to DT/DTedit packages on R 4.3.0 onward version - https://github.com/rstudio/DT/issues/1095
I'll check how to fix this but the idea is to apply this logic as per the page view calculation documentation,
This should give you the correct page view numbers.
Looking at the options you have tried, they all are lacking one or the other thing. Let's take a very small set of rows and try these queries,
Based on the data in table shown there is only 1 valid page views among these 4 hits. As you can see, only selecting page_name also returns rows with NA. Using na.omit works but still returns rows with exclude_hit !=0. Just filtering for exclude_hit will give all NA and Non-NA values of page_name. What you need is a combination of these conditions shown in the last query.
If this is giving you incorrect numbers then I'll suggest you check for the data in the RSID for these things,
Cheers!
Thanks both @Harveer_SinghGi1 & @Jennifer_Dungan !
If I change escape_backslash = TRUE to FALSE then it doesn't get removed and results in a 96% match with AAUI page views vs row count of post_pagename (the smaller value is in DataFeeds).
Looking at distinct post_pagename the values make sense and does not have a single backslash apparently even when I didn't remove those.
accept_language (image below) is what prompted me to think I had to remove escape_backslash from the whole tibble/data frame as it looked odd in that column.... would it be that only some dimensions need to have the backslash removed? or not at all?
Views
Replies
Total Likes
Hi @Harveer_SinghGi1 ,
So is that '\' character unusual to see in AA data feeds exports, as the one in the screenshot I shared?
Regardless if set escape_backslash = TRUE or FALSE I see values that seem to belong to other columns not sure yet of the %, wondering if that is a red flag, also if it is a problem with the way I'm getting the data to R Studio or R failing itself...
Thanks a lot!
Views
Replies
Total Likes
Could it be that the data frame doesn't show the correct order once in R because 650MB is too much data for R Studio to handle? thanks again for the help!
Views
Replies
Total Likes
Hi @Rafael_Sahagun ,
While reading delimited data whenever you see issues like unexpected values flowing into columns where preset values are expected it is almost always due to the delimiter being present in one of the column values, in your case it seems the tab delimiter (\t) is recorded in one of the values that are returned in the data feed.
I'll suggest you narrow down to row from where you start seeing the unexpected values and the row above should be the one with such values containing tab delimiter in one of values. After identifying the rows you can figure out how to drop that row and keep remaining ones.
Cheers!
Thanks @Harveer_SinghGi1
I'll look into that!
Wondering if the issue would have been avoided by checking the box when setting data feeds up?
Views
Replies
Total Likes