3
votes

Situation

We use Cloud Storage to store large elasticsearch results (from aggregations).

To handle these large aggregations in parallel, we store them as multiline JSON dumps.

As a result, to perform parallel processing, many instances will open this file at once, and as a result, hit the URLFetch rate limit because of this documented limitation:

and the calls count against your URL fetch quota, as the library uses the URL Fetch service to interact with Cloud Storage.

Here's the resulting exception:

The pipeline UI gives this error

Here's the code that opens the file:

import cloudstorage as gcs

def open_file(path, mode, **kwargs):
    f = gcs.open(path, mode=mode, **kwargs)
    if not f:
        raise Exception("File could not be opened: %s" % path)

    return f

Question

We need a method of communicating with Cloud Storage that bypasses the URLFetch quotas and rate limits, or it becomes impossible for us to effectively execute parallel processing.

Is there a method of reading GCS files from App Engine that does not route through URLFetch, much like the datastore API does not incur url fetch rate limits?

1
There is no way around URLFetch on "vanilla" app engine due to the sandbox restrictions. Managed VMs are "exempt" from this as they have direct network access. Also, are you really hitting the billing enabled maximum rate limit? If so, I'd suggest to contact Google support to talk about a potential quota bump. - mensi
@mensi It's not a billing quota, as far as I understand it - cloud.google.com/appengine/docs/quotas?hl=en#UrlFetch A long term solution for us will be to move this processing to a dedicated managed vm solution. - Josh
My "billing" reference was in relation to the fact that free apps and billing enabled apps get different quotas. An app with billing enabled should be able to sustain 740 MB/min or 120 requests per minute. As I mentioned before, if your application requires more you should get in touch with Google Cloud Support - mensi

1 Answers

1
votes

Not sure if such approach is compatible/usable with your application, but here goes...

Instead of funneling the results directly to the GCS file during parallel aggregation processing you could use the GAE datastore to store the intermediate aggregation results (more relaxed quotas) and only (assemble, if needed and) ship the final result to GCS after the aggregation is complete, in a single (or just a few) GCS requests.