Hey,
I'm trying to send multiple requests for a single report suites for paging report with params "TOP" and "STARTINGWITH". Is that possible I can send these paging requests in parallel? Will the API prepare the report for this sequentially or also in parallel?
I'm basically concerned about the cardinality. As the estimated row number can be 1M, so I plan to send multiple requests in parallel instead of sequentially, wondering if this can speed up my downloading time in total?
Solved! Go to Solution.
Yep, you just make a simple CM using row count and use that to make your request to get the top value. Note that you will need to include at least one non-calc metric, e.g. visits to actually pull any data. Also note that to get the expected result, you can only have a single breakdown, as you want to get the row count for the single breakdown dimension. This means that you want a report that does not have a date breakdown built-in. I just use a Ranked report.
Re: overlapping data: I almost always truncate the time frame by the current day minus one, and only have had reason to use Ranked reports. Given that I'm looking at historic (stable) data, and each request is identical with the exception of the start and top params, I can't think of any scenarios where this should be possible on my end. You can (and should) test your approach by comparing a chunked report set versus a single call-- I definitely did this when starting out.
It's also helpful to write a helper function to compartmentalize the chunking logic-- and general query construction helpers, tryCatch recovery, logging, etc, depending on how complicated your queues get. I think I read that DW queues have a 72-hour time frame for retrieval, although I don't know if that is the same for the reporting API.
Views
Replies
Total Likes
In our implementation, the server seems to process up to 8 queued reports at a time. The queue seems to be in order of request receipt.
Say you have a baseline list of args, and that you know the 'top' parameter ahead of time. Note the reporting API has a 50K limit for 'top.'
List of 9
$ reportsuite.id : chr "myID"
$ date.from : chr "2016-09-01"
$ date.to : chr "2016-09-02"
$ elements : chr "my_element"
$ metrics : chr [1:2] "pageviews" "visits"
$ segment.id : chr [1:2] "segment1" "segment2"
$ date.granularity: chr "month"
$ top : int 100000
$ enqueueOnly : logi TRUE
You could then chunk this up into say 4 pieces:
List of 4
$ chunk1:List of 10
..$ reportsuite.id : chr "myID"
..$ date.from : chr "2016-09-01"
..$ date.to : chr "2016-09-02"
..$ elements : chr "my_element"
..$ metrics : chr [1:2] "pageviews" "visits"
..$ segment.id : chr [1:2] "segment1" "segment2"
..$ date.granularity: chr "month"
..$ top : num 25000
..$ enqueueOnly : logi TRUE
..$ start : int 1
$ chunk2:List of 10
..$ reportsuite.id : chr "myID"
..$ date.from : chr "2016-09-01"
..$ date.to : chr "2016-09-02"
..$ elements : chr "my_element"
..$ metrics : chr [1:2] "pageviews" "visits"
..$ segment.id : chr [1:2] "segment1" "segment2"
..$ date.granularity: chr "month"
..$ top : num 25000
..$ enqueueOnly : logi TRUE
..$ start : int 25001
$ chunk3:List of 10
..$ reportsuite.id : chr "myID"
..$ date.from : chr "2016-09-01"
..$ date.to : chr "2016-09-02"
..$ elements : chr "my_element"
..$ metrics : chr [1:2] "pageviews" "visits"
..$ segment.id : chr [1:2] "segment1" "segment2"
..$ date.granularity: chr "month"
..$ top : num 25000
..$ enqueueOnly : logi TRUE
..$ start : int 50001
$ chunk4:List of 10
..$ reportsuite.id : chr "myID"
..$ date.from : chr "2016-09-01"
..$ date.to : chr "2016-09-02"
..$ elements : chr "my_element"
..$ metrics : chr [1:2] "pageviews" "visits"
..$ segment.id : chr [1:2] "segment1" "segment2"
..$ date.granularity: chr "month"
..$ top : num 25000
..$ enqueueOnly : logi TRUE
..$ start : int 75001
With enqueueOnly == TRUE, you return reportIDs, which you can then pull down with Report.Get().
I'm using Analytics API directly, is there a param to set enqueueOnly == TRUE? Or this is set by default? I'm using the function Report.Queue() to get the reportId, and use Report.Get().
So do this mean roughly if I originally used 1 single request, and get the entire report after 4 mins. And now I split into 4 requests, and send them to the API queue, suppose currently the queue only contains these 4 requests, do that mean I can get these 4 responses in 1 min (as each request contains only 1/4 of data, and we process them in parallel)?
One more following question, when I split the requests into four, and requests them sequentially, is there an chance some of the data chunk got inconsistent with a new data coming in? For example, for time 0, I request the first chunk, for time 1, I request the second chunk, for time 3, the total data rows increased from 100,000 to 100,001, will that last updated row always in the last chunk?
Views
Replies
Total Likes
Yes, that's the default; the library I use to access the API strings together Report.Queue() and Report.Get() into a single function, and provides the option to use the "default" workflow with the enqueueOnly argument. Sorry for not making this clear.
Pretty much. Informally, I've noticed between 2-6x improvement in throughput, depending on the nature of the request.
You can give yourself a bit of ceiling in your initial top value. I often make a call to determine the number of rows using the Row Count calculated metric for each report that I am going to chunk up. I'm not sure if there is a more elegant way, but there's a bit more context here:
DISTINCT COUNT of element values? · Issue #186 · randyzwitch/RSiteCatalyst · GitHub
In any event, you could add e.g.10% to the number of expected rows to be safe. This way, if your last chunk has fewer rows than specified by its top arg, you know that you've captured all the data.
Great, thanks a lot!
So to get the row count, I need to save a calculated metric for calculating the row count, and then use that number to make my report requests, right?
Once I shard the requests into several chunk, I can always assume the data in between each sharded requests will have NO overlap data, right?
Views
Replies
Total Likes
Yep, you just make a simple CM using row count and use that to make your request to get the top value. Note that you will need to include at least one non-calc metric, e.g. visits to actually pull any data. Also note that to get the expected result, you can only have a single breakdown, as you want to get the row count for the single breakdown dimension. This means that you want a report that does not have a date breakdown built-in. I just use a Ranked report.
Re: overlapping data: I almost always truncate the time frame by the current day minus one, and only have had reason to use Ranked reports. Given that I'm looking at historic (stable) data, and each request is identical with the exception of the start and top params, I can't think of any scenarios where this should be possible on my end. You can (and should) test your approach by comparing a chunked report set versus a single call-- I definitely did this when starting out.
It's also helpful to write a helper function to compartmentalize the chunking logic-- and general query construction helpers, tryCatch recovery, logging, etc, depending on how complicated your queues get. I think I read that DW queues have a 72-hour time frame for retrieval, although I don't know if that is the same for the reporting API.
Views
Replies
Total Likes
Hey sorry,
Can I ask one more question here? Thanks!
I'm trying to use these "start" and "top" params for paging, but seems when I selected multiple metrics (> 1) in the report, the non-last paging looks NOT sorted. While the last page looks sorted.
For example: (sorted by the first metric.
1. first page with (top = 3, startingWith = 0),
Got response:
count: [3, 0, 1]; [3, 0, 2]; [3, 0, 4];
2. second page with (top = 3, startingWith = 3); (Last page)
Got response:
count: [2, 0, 3]; [2, 0, 2].
Views
Replies
Total Likes