UPDATE: (5/18/2020) Solution at the end of this post!
I'm attempting to upload big CSV files (30MB - 2GB) from a browser to GCP App Engine running Python 3.7 + Flask, and then push those files to GCP Storage. This works fine on local testing with large files, but errors out immediately on GCP with a "413 - Your client issued a request that was too large" if the file is larger than roughly 20MB. This error happens instantly on upload before it even reaches my custom Python logic (I suspect App Engine is checking the Content-Length
header). I tried many solutions after lots of SO/blog research to no avail. Note that I am using the basic/free App Engine setup with the F1 instance running the Gunicorn server.
First, I tried setting app.config['MAX_CONTENT_LENGTH'] = 2147483648
but that didn't change anything (SO post). My app still threw an error before it even reached my Python code:
# main.py
app.config['MAX_CONTENT_LENGTH'] = 2147483648 # 2GB limit
@app.route('/', methods=['POST', 'GET'])
def upload():
# COULDN'T GET THIS FAR WITH A LARGE UPLOAD!!!
if flask.request.method == 'POST':
uploaded_file = flask.request.files.get('file')
storage_client = storage.Client()
storage_bucket = storage_client.get_bucket('my_uploads')
blob = storage_bucket.blob(uploaded_file.filename)
blob.upload_from_string(uploaded_file.read())
<!-- index.html -->
<form method="POST" action='/upload' enctype="multipart/form-data">
<input type="file" name="file">
</form>
After further research, I switched to chunked uploads with Flask-Dropzone
, hoping I could upload the data in batches then append/build-up the CSV files as a Storage Blob:
# main.py
app = flask.Flask(__name__)
app.config['MAX_CONTENT_LENGTH'] = 2147483648 # 2GB limit
dropzone = Dropzone(app)
@app.route('/', methods=['POST', 'GET'])
def upload():
if flask.request.method == 'POST':
uploaded_file = flask.request.files.get('file')
storage_client = storage.Client()
storage_bucket = storage_client.get_bucket('my_uploads')
CHUNK_SIZE = 10485760 # 10MB
blob = storage_bucket.blob(uploaded_file.filename, chunk_size=self.CHUNK_SIZE)
# hoping for a create-if-not-exists then append thereafter
blob.upload_from_string(uploaded_file.read())
And the JS/HTML is straight from a few samples I found online:
<script>
Dropzone.options.myDropzone = {
timeout: 300000,
chunking: true,
chunkSize: 10485760 };
</script>
....
<form method="POST" action='/upload' class="dropzone dz-clickable"
id="dropper" enctype="multipart/form-data">
</form>
The above does upload in chunks (I can see repeated calls to POST /upload), but, the call to blob.upload_from_string(uploaded_file.read())
just keeps replacing the blob contents with the last chunk uploaded instead of appending. This also doesn't work even if I strip out the chunk_size=self.CHUNK_SIZE
parameter.
Next I looked at writing to /tmp
then to Storage but the docs say writing to /tmp
takes up the little memory I have, and the filesystem elsewhere is read-only, so neither of those will work.
Is there an append API or approved methodology to upload big files to GCP App Engine and push/stream to Storage? Given the code works on my local server (and happily uploads to GCP Storage), I'm assuming this is a built-in limitation in App Engine that needs to be worked around.
SOLUTION (5/18/2020) I was able to use Flask-Dropzone to have JavaScript split the upload into many 10MB chunks and send those chunks one at a time to the Python server. On the Python side of things we'd keep appending to a file in /tmp to "build up" the contents until all chunks came in. Finally, on the last chunk we'd upload to GCP Storage then delete the /tmp file.
@app.route('/upload', methods=['POST'])
def upload():
uploaded_file = flask.request.files.get('file')
tmp_file_path = '/tmp/' + uploaded_file.filename
with open(tmp_file_path, 'a') as f:
f.write(uploaded_file.read().decode("UTF8"))
chunk_index = int(flask.request.form.get('dzchunkindex')) if (flask.request.form.get('dzchunkindex') is not None) else 0
chunk_count = int(flask.request.form.get('dztotalchunkcount')) if (flask.request.form.get('dztotalchunkcount') is not None) else 1
if (chunk_index == (chunk_count - 1)):
print('Saving file to storage')
storage_bucket = storage_client.get_bucket('prairi_uploads')
blob = storage_bucket.blob(uploaded_file.filename) #CHUNK??
blob.upload_from_filename(tmp_file_path, client=storage_client)
print('Saved to Storage')
print('Deleting temp file')
os.remove(tmp_file_path)
<!-- index.html -->
<script>
Dropzone.options.myDropzone = {
... // configs
timeout: 300000,
chunking: true,
chunkSize: 1000000
};
</script>
Note that /tmp shares resources with RAM, so you need at least as much RAM as the as the uploaded file size, plus more for Python itself (I had to use an F4 instance). I would imagine there's a better solution to write to block storage instead of /tmp, but I haven't gotten that far yet.