1
votes

I'm trying to load some larger .npy files in cupy with memory mapped mode, but I keep running into OutOfMemoryError .

I thought that since it's being opened in memory mapped mode, this operation shouldn't take much memory since a memory map doesn't actually load the whole array into memory.

I can load these files with np.load just fine, this only seems to happen with cupy.load. My enviroment is Google Colab, with the Tesla K80 GPU. It has about 12 gigs CPU ram, 12 gigs GPU ram, and 350 gb disk space.

Here is a minimal example to reproduce the error:

import os
import numpy as np
import cupy

#Create .npy files. 
for i in range(4):
    numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 128 ))
    np.save( 'reg.memmap'+str(i) , numpyMemmap )
    del numpyMemmap
    os.remove( 'reg.memmap'+str(i) )

# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
    NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
del NPYmemmap

# Eventually results in memory error. 
CPYmemmap = []
for i in range(4):
    print(i)
    CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )

Output:

0
1
/usr/local/lib/python3.6/dist-packages/cupy/creation/from_data.py:41: UserWarning: Using synchronous transfer as pinned memory (5120000000 bytes) could not be allocated. This generally occurs because of insufficient host memory. The original error was: cudaErrorMemoryAllocation: out of memory
  return core.array(obj, dtype, copy, order, subok, ndmin)
2
3
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
<ipython-input-4-b5c849e2adba> in <module>()
      2 for i in range(4):
      3     print(i)
----> 4     CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )

1 frames
/usr/local/lib/python3.6/dist-packages/cupy/io/npz.py in load(file, mmap_mode)
     47     obj = numpy.load(file, mmap_mode)
     48     if isinstance(obj, numpy.ndarray):
---> 49         return cupy.array(obj)
     50     elif isinstance(obj, numpy.lib.npyio.NpzFile):
     51         return NpzFile(obj)

/usr/local/lib/python3.6/dist-packages/cupy/creation/from_data.py in array(obj, dtype, copy, order, subok, ndmin)
     39 
     40     """
---> 41     return core.array(obj, dtype, copy, order, subok, ndmin)
     42 
     43 

cupy/core/core.pyx in cupy.core.core.array()

cupy/core/core.pyx in cupy.core.core.array()

cupy/core/core.pyx in cupy.core.core.ndarray.__init__()

cupy/cuda/memory.pyx in cupy.cuda.memory.alloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory._try_malloc()

OutOfMemoryError: out of memory to allocate 5120000000 bytes (total 20480000000 bytes)

I am also wondering if this is perhaps related to Google Colab and their enviroment/GPU.

For convenience, here is a Google Colab notebook of this minimal code

https://colab.research.google.com/drive/12uPL-ZnKhGTJifZGVdTN7e8qBRRus4tA

1
So it sounds like pinning any file on disk will take the same amount of space in the GPU RAM as it does on the disk. So, would this mean there is no advantage of having GPU memory mapped data over loading that data right into the GPU Ram? What would be the fastest way method for a GPU to save to/from a disk, if the data was bigger than the GPU RAM ?SantoshGupta7
This answers my question exactly, if you submit this as an answer I would accept it. I am trying to study more about the processes you describe. I Googled 'cuda pin memory cpu gpu ram' but none mentioned (as far as I could tell) that pinning requires CPU not GPU memory. If there's a source that you specifically recommend for my situation, let me know.SantoshGupta7

1 Answers

2
votes

The numpy.load mechanism for a disk file when memory-mapped may not require the entire file to be loaded from disk into host memory.

However it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.

Your particular test case appears to be creating 4 disk files of ~5GB size each. These won't all fit in either host or device memory if you have 12GB of each. Therefore I would expect things to fail on the 3rd file load, if not earlier.

It may be possible to use your numpy.load mechanism with mapped memory, and then selectively move portions of that data to the GPU with cupy operations. In that case, the data size on the GPU would still be limited to GPU RAM, for the usual things like cupy arrays.

Even if you could used CUDA pinned "zero-copy" memory, it would still be limited to the host memory size (12GB, here) or less.