Objective/Problem
In Python, I am looking for a fast way to read/write data from a memory mapped file to a GPU.
In a previous SO overflow post [ Cupy OutOfMemoryError when trying to cupy.load larger dimension .npy files in memory map mode, but np.load works fine ]
Where it is mentioned this is possible using CUDA pinned "zero-copy" memory. Furthermore, it seems that this method was developed by this person [ cuda - Zero-copy memory, memory-mapped file ] though that person was working in C++.
My previous attempts have been with Cupy, but I am open to any cuda methods.
What I have tried so far
I mentioned how I tried to use Cupy, which allows you to open numpy files in memmory mapped mode.
import os
import numpy as np
import cupy
#Create .npy files.
for i in range(4):
numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 2200000 , 512))
np.save( 'reg.memmap'+str(i) , numpyMemmap )
del numpyMemmap
os.remove( 'reg.memmap'+str(i) )
# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
del NPYmemmap
# Eventually results in memory error.
CPYmemmap = []
for i in range(4):
print(i)
CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
Result of what I have tried
My attempt resulting in OutOfMemoryError:
It was mentioned that
it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.
And it was also mentioned that
CuPy can't handle mmap memory. So, CuPy uses GPU memory directly in default. https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.MemoryPool.html#cupy.cuda.MemoryPool.malloc You can change default memory allocator if you want to use Unified Memory.
I tried using
cupy.cuda.set_allocator(cupy.cuda.MemoryPool(cupy.cuda.memory.malloc_managed).malloc)
But this didn't seem to make a difference. At the time of the error, my CPU Ram was at ~16 gigs, but my GPU ram was at 0.32 gigs. I am using Google colab where my CPU Ram is 25 gigs and GPU ram is 12 gigs. So it looks like that after the entire file was hosted in host memory, it checked that if it could fit in device memory, and when it saw that it only has 12 out of the required 16 gigs, it threw an error (my best guess).
So, now I am trying to figure out a way to use pinned 'zero-copy' memory to handle a memory mapped file which would feed data to the GPU.
If important, the type of data I am trying to transfer are floating point arrays. Normally, for read-only data, binary files are loaded into GPU memory, but I am working with data I am try to both read and write at every step.