2
votes

We have developed an importing solution for one of our clients. It parses and converts data contained in many OneNote notebooks, to required proprietary data structures, for the client to store and use within another information system.

There is substantial amount of data across many notebooks, requiring a considerable amount of Graph API queries to be performed, in order to retrieve all of the data.

In essence, we built a bulk-importing (batch process, essentially) solution, which goes through all OneNote notebooks under a client's account, parses sections and pages data of each, as well as downloads and stores all page content - including linked documents and images. The linked documents and images require the most amount of Graph API queries.

When performing these imports, the Graph API throttling issue arises. After certain time, even though we are sending queries at a relatively low rate, we start getting the 429 errors.

Regarding data volume, average section size of a client notebook is 50-70 pages. Each page contains links to about 5 documents for download, on average. Thus, it requires up to 70+350 requests to retrieve all the pages content and files of a single notebook section. And our client has many such sections in a notebook. In turn, there are many notebooks.

In total, there are approximately 150 such sections across several notebooks that we need to import for our client. Considering the stats above, this means that our import needs to make a total of 60000-65000 Graph API queries, estimated.

To not flood the Graph API service and keep within the throttling limits, we have experimented a lot and gradually decreased our request rate to be just 1 query for every 4 seconds. That is, at max 900 Graph API requests are made per hour.

This already makes each section import noticeably slow - but it is endurable, even though it means that our full import would take up to 72 continuous hours to complete.

However - even with our throttling logic at this rate implemented and proven working, we still get 429 "too many requests" errors from the Graph API, after about 1hr 10mins, about 1100 consequtive queries. As a result, we are unable to proceed our import on all remaining, unfinished notebook sections. This enables us to only import a few sections consequtively, having then to wait for some random while before we can manually attempt to continue the importing again.

So this is our problem that we seek help with - especially from Microsoft representatives. Can Microsoft provide a way for us to be able to perform this importing of these 60...65K pages+documents, at a reasonably fast query rate, without getting throttled, so we could just get the job done in a continuous batch process, for our client? In example, as either a separate access point (dedicated service endpoint), perhaps time-constrained eg configured for our use within a certain period - so we could within that period, perform all the necessary imports?

For additional information - we currently load the data using the following Graph API URL-s (placeholders of actual different values are brought in uppercase letters between curly braces):

Pages under the notebook section: https://graph.microsoft.com/v1.0/users/{USER}/onenote/sections/{SECTION_ID}/pages?...

Content of a page: https://graph.microsoft.com/v1.0/users/{USER}/onenote/pages/{PAGE_ID}/content

A file (document or image) eg link from the page content: https://graph.microsoft.com/v1.0/{USER}/onenote/resources/{RESOURCE_ID}/$value

2

2 Answers

1
votes

which call is most likely to cause the throttling?

What can you retrieve before throttling - just pageids (150 calls total) or pageids+content (10000 calls)? If the latter can you store the results (eg sql database) so that you don't have to call these again.

If you can get pageids+content can you then access the resources using preAuthenticated=true (maybe this is less likely to be throttled). I don't actually offline images as I usually deal with ink or print.

I find the onenote API is very sensitive to multiple calls without waiting for them to complete, I find more than 12 simultaneous calls via a curl multi technique problematic. Once you get throttled if you don't back off immediately you can be throttled for a long, long time. I usually have my scripts bail if I get too many 429 in a row (I have it set for 10 simultaneous 429s and it bails for 10 minutes).

1
votes

We now have the solution released & working in production. Turns out that indeed adding ?preAuthenticated=true to the page requests returns the page content having resource links (for contained documents, images) in a different format. Then, as it seems, querying these resource links will not impact the API throttling counters - as we've had no 429 errors since.

We even managed to bring the call rate down to 2 seconds from 4, without any problems. So I have marked codeeye's answer as the accepted one.