Expand my Community achievements bar.

SOLVED

Paging request can be processed in parallel from API?

Avatar

Level 2

Hey,

I'm trying to send multiple requests for a single report suites for paging report with params "TOP" and "STARTINGWITH". Is that possible I can send these paging requests in parallel? Will the API prepare the report for this sequentially or also in parallel?

I'm basically concerned about the cardinality. As the estimated row number can be 1M, so I plan to send multiple requests in parallel instead of sequentially, wondering if this can speed up my downloading time in total?

1 Accepted Solution

Avatar

Correct answer by
Former Community Member

Yep, you just make a simple CM using row count and use that to make your request to get the top value. Note that you will need to include at least one non-calc metric, e.g. visits to actually pull any data. Also note that to get the expected result, you can only have a single breakdown, as you want to get the row count for the single breakdown dimension. This means that you want a report that does not have a date breakdown built-in. I just use a Ranked report.

Re: overlapping data: I almost always truncate the time frame by the current day minus one, and only have had reason to use Ranked reports. Given that I'm looking at historic (stable) data, and each request is identical with the exception of the start and top params, I can't think of any scenarios where this should be possible on my end. You can (and should) test your approach by comparing a chunked report set versus a single call-- I definitely did this when starting out.

It's also helpful to write a helper function to compartmentalize the chunking logic-- and general query construction helpers, tryCatch recovery, logging, etc, depending on how complicated your queues get. I think I read that DW queues have a 72-hour time frame for retrieval, although I don't know if that is the same for the reporting API.

View solution in original post

6 Replies

Avatar

Former Community Member

In our implementation, the server seems to process up to 8 queued reports at a time. The queue seems to be in order of request receipt.

Say you have a baseline list of args, and that you know the 'top' parameter ahead of time. Note the reporting API has a 50K limit for 'top.'

List of 9

$ reportsuite.id  : chr "myID"

$ date.from       : chr "2016-09-01"

$ date.to         : chr "2016-09-02"

$ elements        : chr "my_element"

$ metrics         : chr [1:2] "pageviews" "visits"

$ segment.id      : chr [1:2] "segment1" "segment2"

$ date.granularity: chr "month"

$ top             : int 100000

$ enqueueOnly     : logi TRUE

You could then chunk this up into say 4 pieces:

List of 4

$ chunk1:List of 10

  ..$ reportsuite.id  : chr "myID"

  ..$ date.from       : chr "2016-09-01"

  ..$ date.to         : chr "2016-09-02"

  ..$ elements        : chr "my_element"

  ..$ metrics         : chr [1:2] "pageviews" "visits"

  ..$ segment.id      : chr [1:2] "segment1" "segment2"

  ..$ date.granularity: chr "month"

  ..$ top             : num 25000

  ..$ enqueueOnly     : logi TRUE

  ..$ start           : int 1

$ chunk2:List of 10

  ..$ reportsuite.id  : chr "myID"

  ..$ date.from       : chr "2016-09-01"

  ..$ date.to         : chr "2016-09-02"

  ..$ elements        : chr "my_element"

  ..$ metrics         : chr [1:2] "pageviews" "visits"

  ..$ segment.id      : chr [1:2] "segment1" "segment2"

  ..$ date.granularity: chr "month"

  ..$ top             : num 25000

  ..$ enqueueOnly     : logi TRUE

  ..$ start           : int 25001

$ chunk3:List of 10

  ..$ reportsuite.id  : chr "myID"

  ..$ date.from       : chr "2016-09-01"

  ..$ date.to         : chr "2016-09-02"

  ..$ elements        : chr "my_element"

  ..$ metrics         : chr [1:2] "pageviews" "visits"

  ..$ segment.id      : chr [1:2] "segment1" "segment2"

  ..$ date.granularity: chr "month"

  ..$ top             : num 25000

  ..$ enqueueOnly     : logi TRUE

  ..$ start           : int 50001

$ chunk4:List of 10

  ..$ reportsuite.id  : chr "myID"

  ..$ date.from       : chr "2016-09-01"

  ..$ date.to         : chr "2016-09-02"

  ..$ elements        : chr "my_element"

  ..$ metrics         : chr [1:2] "pageviews" "visits"

  ..$ segment.id      : chr [1:2] "segment1" "segment2"

  ..$ date.granularity: chr "month"

  ..$ top             : num 25000

  ..$ enqueueOnly     : logi TRUE

  ..$ start           : int 75001

With enqueueOnly == TRUE, you return reportIDs, which you can then pull down with Report.Get().

Avatar

Level 2

I'm using Analytics API directly, is there a param to set enqueueOnly == TRUE? Or this is set by default? I'm using the function Report.Queue() to get the reportId, and use Report.Get().

So do this mean roughly if I originally used 1 single request, and get the entire report after 4 mins. And now I split into 4 requests, and send them to the API queue, suppose currently the queue only contains these 4 requests, do that mean I can get these 4 responses in 1 min (as each request contains only 1/4 of data, and we process them in parallel)?

One more following question, when I split the requests into four, and requests them sequentially, is there an chance some of the data chunk got inconsistent with a new data coming in? For example, for time 0, I request the first chunk, for time 1, I request the second chunk, for time 3, the total data rows increased from 100,000 to 100,001, will that last updated row always in the last chunk?

Avatar

Former Community Member

Yes, that's the default; the library I use to access the API strings together Report.Queue() and Report.Get() into a single function, and provides the option to use the "default" workflow with the enqueueOnly argument. Sorry for not making this clear.

Pretty much. Informally, I've noticed between 2-6x improvement in throughput, depending on the nature of the request.

You can give yourself a bit of ceiling in your initial top value. I often make a call to determine the number of rows using the Row Count calculated metric for each report that I am going to chunk up. I'm not sure if there is a more elegant way, but there's a bit more context here:

DISTINCT COUNT of element values? · Issue #186 · randyzwitch/RSiteCatalyst · GitHub

In any event, you could add e.g.10% to the number of expected rows to be safe. This way, if your last chunk has fewer rows than specified by its top arg, you know that you've captured all the data.

Avatar

Level 2

Great, thanks a lot!

So to get the row count, I need to save a calculated metric for calculating the row count, and then use that number to make my report requests, right?

Once I shard the requests into several chunk, I can always assume the data in between each sharded requests will have NO overlap data, right?

Avatar

Correct answer by
Former Community Member

Yep, you just make a simple CM using row count and use that to make your request to get the top value. Note that you will need to include at least one non-calc metric, e.g. visits to actually pull any data. Also note that to get the expected result, you can only have a single breakdown, as you want to get the row count for the single breakdown dimension. This means that you want a report that does not have a date breakdown built-in. I just use a Ranked report.

Re: overlapping data: I almost always truncate the time frame by the current day minus one, and only have had reason to use Ranked reports. Given that I'm looking at historic (stable) data, and each request is identical with the exception of the start and top params, I can't think of any scenarios where this should be possible on my end. You can (and should) test your approach by comparing a chunked report set versus a single call-- I definitely did this when starting out.

It's also helpful to write a helper function to compartmentalize the chunking logic-- and general query construction helpers, tryCatch recovery, logging, etc, depending on how complicated your queues get. I think I read that DW queues have a 72-hour time frame for retrieval, although I don't know if that is the same for the reporting API.

Avatar

Level 2

Hey sorry,

Can I ask one more question here? Thanks!

I'm trying to use these "start" and "top" params for paging, but seems when I selected multiple metrics (> 1) in the report, the non-last paging looks NOT sorted. While the last page looks sorted.

For example: (sorted by the first metric.

1. first page with (top = 3, startingWith = 0),

Got response:

count: [3, 0, 1]; [3, 0, 2]; [3, 0, 4];

2. second page with (top = 3, startingWith = 3); (Last page)

Got response:

count: [2, 0, 3]; [2, 0, 2].

The following has evaluated to null or missing: ==> liqladmin("SELECT id, value FROM metrics WHERE id = 'net_accepted_solutions' and user.id = '${acceptedAnswer.author.id}'").data.items [in template "analytics-container" at line 83, column 41] ---- Tip: It's the step after the last dot that caused this error, not those before it. ---- Tip: If the failing expression is known to be legally refer to something that's sometimes null or missing, either specify a default value like myOptionalVar!myDefault, or use <#if myOptionalVar??>when-present<#else>when-missing. (These only cover the last step of the expression; to cover the whole expression, use parenthesis: (myOptionalVar.foo)!myDefault, (myOptionalVar.foo)?? ---- ---- FTL stack trace ("~" means nesting-related): - Failed at: #assign answerAuthorNetSolutions = li... [in template "analytics-container" at line 83, column 5] ----