2
votes

UPDATE: (5/18/2020) Solution at the end of this post!

I'm attempting to upload big CSV files (30MB - 2GB) from a browser to GCP App Engine running Python 3.7 + Flask, and then push those files to GCP Storage. This works fine on local testing with large files, but errors out immediately on GCP with a "413 - Your client issued a request that was too large" if the file is larger than roughly 20MB. This error happens instantly on upload before it even reaches my custom Python logic (I suspect App Engine is checking the Content-Length header). I tried many solutions after lots of SO/blog research to no avail. Note that I am using the basic/free App Engine setup with the F1 instance running the Gunicorn server.

First, I tried setting app.config['MAX_CONTENT_LENGTH'] = 2147483648 but that didn't change anything (SO post). My app still threw an error before it even reached my Python code:

# main.py
    app.config['MAX_CONTENT_LENGTH'] = 2147483648   # 2GB limit

    @app.route('/', methods=['POST', 'GET'])
    def upload():
        # COULDN'T GET THIS FAR WITH A LARGE UPLOAD!!!
        if flask.request.method == 'POST':

            uploaded_file = flask.request.files.get('file')

            storage_client = storage.Client()
            storage_bucket = storage_client.get_bucket('my_uploads')

            blob = storage_bucket.blob(uploaded_file.filename)
            blob.upload_from_string(uploaded_file.read())

<!-- index.html -->
    <form method="POST" action='/upload' enctype="multipart/form-data">
        <input type="file" name="file">
    </form>

After further research, I switched to chunked uploads with Flask-Dropzone, hoping I could upload the data in batches then append/build-up the CSV files as a Storage Blob:

# main.py
app = flask.Flask(__name__)
app.config['MAX_CONTENT_LENGTH'] = 2147483648   # 2GB limit
dropzone = Dropzone(app)


@app.route('/', methods=['POST', 'GET'])
def upload():

    if flask.request.method == 'POST':

        uploaded_file = flask.request.files.get('file')

        storage_client = storage.Client()
        storage_bucket = storage_client.get_bucket('my_uploads')

        CHUNK_SIZE = 10485760  # 10MB
        blob = storage_bucket.blob(uploaded_file.filename, chunk_size=self.CHUNK_SIZE)

        # hoping for a create-if-not-exists then append thereafter
        blob.upload_from_string(uploaded_file.read())

And the JS/HTML is straight from a few samples I found online:

    <script>
       Dropzone.options.myDropzone = {
       timeout: 300000,
       chunking: true,
       chunkSize: 10485760 };
    </script>
    ....
    <form method="POST" action='/upload' class="dropzone dz-clickable" 
      id="dropper" enctype="multipart/form-data">
    </form>

The above does upload in chunks (I can see repeated calls to POST /upload), but, the call to blob.upload_from_string(uploaded_file.read()) just keeps replacing the blob contents with the last chunk uploaded instead of appending. This also doesn't work even if I strip out the chunk_size=self.CHUNK_SIZE parameter.

Next I looked at writing to /tmp then to Storage but the docs say writing to /tmp takes up the little memory I have, and the filesystem elsewhere is read-only, so neither of those will work.

Is there an append API or approved methodology to upload big files to GCP App Engine and push/stream to Storage? Given the code works on my local server (and happily uploads to GCP Storage), I'm assuming this is a built-in limitation in App Engine that needs to be worked around.


SOLUTION (5/18/2020) I was able to use Flask-Dropzone to have JavaScript split the upload into many 10MB chunks and send those chunks one at a time to the Python server. On the Python side of things we'd keep appending to a file in /tmp to "build up" the contents until all chunks came in. Finally, on the last chunk we'd upload to GCP Storage then delete the /tmp file.

@app.route('/upload', methods=['POST'])
def upload():

    uploaded_file = flask.request.files.get('file')

    tmp_file_path = '/tmp/' + uploaded_file.filename
    with open(tmp_file_path, 'a') as f:
        f.write(uploaded_file.read().decode("UTF8"))

    chunk_index = int(flask.request.form.get('dzchunkindex')) if (flask.request.form.get('dzchunkindex') is not None)  else 0
    chunk_count = int(flask.request.form.get('dztotalchunkcount')) if (flask.request.form.get('dztotalchunkcount') is not None)  else 1

    if (chunk_index == (chunk_count - 1)):
        print('Saving file to storage')
        storage_bucket = storage_client.get_bucket('prairi_uploads')
        blob = storage_bucket.blob(uploaded_file.filename) #CHUNK??

        blob.upload_from_filename(tmp_file_path, client=storage_client)
        print('Saved to Storage')

        print('Deleting temp file')
        os.remove(tmp_file_path)
<!-- index.html -->
        <script>
          Dropzone.options.myDropzone = {
          ... // configs
          timeout: 300000,
          chunking: true,
          chunkSize: 1000000
        };
        </script>

Note that /tmp shares resources with RAM, so you need at least as much RAM as the as the uploaded file size, plus more for Python itself (I had to use an F4 instance). I would imagine there's a better solution to write to block storage instead of /tmp, but I haven't gotten that far yet.

1

1 Answers

0
votes

The answer is that you cannot upload or download files larger than 32 MB in a single HTTP request. Source

You either need to redesign your service to transfer data in multiple HTTP requests, transfer data directly to Cloud Storage using Presigned URLs, or select a different service that does NOT use the Global Front End (GFE) such as Compute Engine. This excludes services such as Cloud Functions, Cloud Run, App Engine Flexible.

If you use multiple HTTP requests, you will need to manage memory as all temporary files are stored in memory. This means you will have issues as you approach the maximum instance size of 2 GB.