3
votes

Memory leakage is detected via memory_profiler. Since such big file will be uploaded from 128MB GCF or f1-micro GCE, how could I prevent this memory leakage?

✗ python -m memory_profiler tests/test_gcp_storage.py
67108864

Filename: tests/test_gcp_storage.py

Line #    Mem usage    Increment   Line Contents
================================================
    48   35.586 MiB   35.586 MiB   @profile
    49                             def test_upload_big_file():
    50   35.586 MiB    0.000 MiB     from google.cloud import storage
    51   35.609 MiB    0.023 MiB     client = storage.Client()
    52                             
    53   35.609 MiB    0.000 MiB     m_bytes = 64
    54   35.609 MiB    0.000 MiB     filename = int(datetime.utcnow().timestamp())
    55   35.609 MiB    0.000 MiB     blob_name = f'test/{filename}'
    56   35.609 MiB    0.000 MiB     bucket_name = 'my_bucket'
    57   38.613 MiB    3.004 MiB     bucket = client.get_bucket(bucket_name)
    58                             
    59   38.613 MiB    0.000 MiB     with open(f'/tmp/{filename}', 'wb+') as file_obj:
    60   38.613 MiB    0.000 MiB       file_obj.seek(m_bytes * 1024 * 1024 - 1)
    61   38.613 MiB    0.000 MiB       file_obj.write(b'\0')
    62   38.613 MiB    0.000 MiB       file_obj.seek(0)
    63                             
    64   38.613 MiB    0.000 MiB       blob = bucket.blob(blob_name)
    65  102.707 MiB   64.094 MiB       blob.upload_from_file(file_obj)
    66                             
    67  102.715 MiB    0.008 MiB     blob = bucket.get_blob(blob_name)
    68  102.719 MiB    0.004 MiB     print(blob.size)

Moreover, if the file is not open with binary mode, the memory leakage will be twice as the file size.

67108864
Filename: tests/test_gcp_storage.py

Line #    Mem usage    Increment   Line Contents
================================================
    48   35.410 MiB   35.410 MiB   @profile
    49                             def test_upload_big_file():
    50   35.410 MiB    0.000 MiB     from google.cloud import storage
    51   35.441 MiB    0.031 MiB     client = storage.Client()
    52                             
    53   35.441 MiB    0.000 MiB     m_bytes = 64
    54   35.441 MiB    0.000 MiB     filename = int(datetime.utcnow().timestamp())
    55   35.441 MiB    0.000 MiB     blob_name = f'test/{filename}'
    56   35.441 MiB    0.000 MiB     bucket_name = 'my_bucket'
    57   38.512 MiB    3.070 MiB     bucket = client.get_bucket(bucket_name)
    58                             
    59   38.512 MiB    0.000 MiB     with open(f'/tmp/{filename}', 'w+') as file_obj:
    60   38.512 MiB    0.000 MiB       file_obj.seek(m_bytes * 1024 * 1024 - 1)
    61   38.512 MiB    0.000 MiB       file_obj.write('\0')
    62   38.512 MiB    0.000 MiB       file_obj.seek(0)
    63                             
    64   38.512 MiB    0.000 MiB       blob = bucket.blob(blob_name)
    65  152.250 MiB  113.738 MiB       blob.upload_from_file(file_obj)
    66                             
    67  152.699 MiB    0.449 MiB     blob = bucket.get_blob(blob_name)
    68  152.703 MiB    0.004 MiB     print(blob.size)

GIST: https://gist.github.com/northtree/8b560a6b552a975640ec406c9f701731

1
Once blob goes out of scope, is the memory still in use? - Maximilian
I have tried with your code (the binary and non binary way) both gave me the same file size. Using memory_profiler I didn't got any memory increment when uploading the blob on either version. Try deleting the blob's memory after uploading it (del blob), or try the "upload_from_filename" method to see if you face the same issue -> googleapis.github.io/google-cloud-python/latest/storage/… . Let me know. - Mayeru
@Maximilian I suppose the blob should be auto released outside with. - northtree
@Mayeru I had run multiple times with python3.7 and google-cloud-storage==1.16.1 in OS X. Are you running in different env? Thanks. - northtree
Some advice on how to write code for the cloud: 1) You do not have a memory leak unless you have code that is not displayed. 2) You do not want to allocate large blocks of memory to read a file into. 128 MB is big - too big. 3) Internet connections fail, timeout, packets get dropped, have errors, so you want to upload in smaller blocks like 64 KB or 1 MB per I/O with retry logic. 4) Performance is increased by multi-part uploads. Typically, two to four threads will double the performance. I realize that your question is "memory leaks" but write good code and then quality check the good code. - John Hanley

1 Answers

0
votes

To limit the amount of memory used during an upload, you need to explicitly configure a chunk size on the blob before you call upload_from_file():

blob = bucket.blob(blob_name, chunk_size=10*1024*1024)
blob.upload_from_file(file_obj)

I agree this is bad default behaviour of the Google client SDK, and the workaround is badly documented as well.