How to fix memory leak on uploading file to Google Cloud Storage?

Question

Memory leakage is detected via memory_profiler. Since such big file will be uploaded from 128MB GCF or f1-micro GCE, how could I prevent this memory leakage?

✗ python -m memory_profiler tests/test_gcp_storage.py

67108864

Filename: tests/test_gcp_storage.py

Line #    Mem usage    Increment   Line Contents
================================================
    48   35.586 MiB   35.586 MiB   @profile
    49                             def test_upload_big_file():
    50   35.586 MiB    0.000 MiB     from google.cloud import storage
    51   35.609 MiB    0.023 MiB     client = storage.Client()
    52                             
    53   35.609 MiB    0.000 MiB     m_bytes = 64
    54   35.609 MiB    0.000 MiB     filename = int(datetime.utcnow().timestamp())
    55   35.609 MiB    0.000 MiB     blob_name = f'test/{filename}'
    56   35.609 MiB    0.000 MiB     bucket_name = 'my_bucket'
    57   38.613 MiB    3.004 MiB     bucket = client.get_bucket(bucket_name)
    58                             
    59   38.613 MiB    0.000 MiB     with open(f'/tmp/{filename}', 'wb+') as file_obj:
    60   38.613 MiB    0.000 MiB       file_obj.seek(m_bytes * 1024 * 1024 - 1)
    61   38.613 MiB    0.000 MiB       file_obj.write(b'\0')
    62   38.613 MiB    0.000 MiB       file_obj.seek(0)
    63                             
    64   38.613 MiB    0.000 MiB       blob = bucket.blob(blob_name)
    65  102.707 MiB   64.094 MiB       blob.upload_from_file(file_obj)
    66                             
    67  102.715 MiB    0.008 MiB     blob = bucket.get_blob(blob_name)
    68  102.719 MiB    0.004 MiB     print(blob.size)

Moreover, if the file is not open with binary mode, the memory leakage will be twice as the file size.

67108864
Filename: tests/test_gcp_storage.py

Line #    Mem usage    Increment   Line Contents
================================================
    48   35.410 MiB   35.410 MiB   @profile
    49                             def test_upload_big_file():
    50   35.410 MiB    0.000 MiB     from google.cloud import storage
    51   35.441 MiB    0.031 MiB     client = storage.Client()
    52                             
    53   35.441 MiB    0.000 MiB     m_bytes = 64
    54   35.441 MiB    0.000 MiB     filename = int(datetime.utcnow().timestamp())
    55   35.441 MiB    0.000 MiB     blob_name = f'test/{filename}'
    56   35.441 MiB    0.000 MiB     bucket_name = 'my_bucket'
    57   38.512 MiB    3.070 MiB     bucket = client.get_bucket(bucket_name)
    58                             
    59   38.512 MiB    0.000 MiB     with open(f'/tmp/{filename}', 'w+') as file_obj:
    60   38.512 MiB    0.000 MiB       file_obj.seek(m_bytes * 1024 * 1024 - 1)
    61   38.512 MiB    0.000 MiB       file_obj.write('\0')
    62   38.512 MiB    0.000 MiB       file_obj.seek(0)
    63                             
    64   38.512 MiB    0.000 MiB       blob = bucket.blob(blob_name)
    65  152.250 MiB  113.738 MiB       blob.upload_from_file(file_obj)
    66                             
    67  152.699 MiB    0.449 MiB     blob = bucket.get_blob(blob_name)
    68  152.703 MiB    0.004 MiB     print(blob.size)

GIST: https://gist.github.com/northtree/8b560a6b552a975640ec406c9f701731

I have tried with your code (the binary and non binary way) both gave me the same file size. Using memory_profiler I didn't got any memory increment when uploading the blob on either version. Try deleting the blob's memory after uploading it (del blob), or try the "upload_from_filename" method to see if you face the same issue -> googleapis.github.io/google-cloud-python/latest/storage/… . Let me know. — Mayeru
@Maximilian I suppose the blob should be auto released outside with. — northtree
@Mayeru I had run multiple times with python3.7 and google-cloud-storage==1.16.1 in OS X. Are you running in different env? Thanks. — northtree
Some advice on how to write code for the cloud: 1) You do not have a memory leak unless you have code that is not displayed. 2) You do not want to allocate large blocks of memory to read a file into. 128 MB is big - too big. 3) Internet connections fail, timeout, packets get dropped, have errors, so you want to upload in smaller blocks like 64 KB or 1 MB per I/O with retry logic. 4) Performance is increased by multi-part uploads. Typically, two to four threads will double the performance. I realize that your question is "memory leaks" but write good code and then quality check the good code. — John Hanley

Pieter Ennes Pieter Ennes · Accepted Answer · 2019-09-18T11:16:15

To limit the amount of memory used during an upload, you need to explicitly configure a chunk size on the blob before you call upload_from_file():

blob = bucket.blob(blob_name, chunk_size=10*1024*1024)
blob.upload_from_file(file_obj)

I agree this is bad default behaviour of the Google client SDK, and the workaround is badly documented as well.

How to fix memory leak on uploading file to Google Cloud Storage?

1 Answers