Finding the right chunk cache size
At first I want to discuss some general things.
It is very important to know that each individual chunk could only be read or written as a whole. The standard chunk-cache size of h5py which can avoid excessive disk I/Os is only one MB per default and should in many cases be increased, which will be discussed later on.
As an example:
- We have a dset with shape (639038, 10000), float32 (25,5 GB uncompressed)
- we want to write our data column wise
dset[:,i]=arr
and read it row wise arr=dset[i,:]
- we choose a completely wrong chunk-shape for this type of work ie (1,10000)
In this case reading speed won't be to bad (although the chunk size is a little small) because we read only the data we are using. But what happens when we write on that dataset? If we access a column one floating point number of each chunk is written. This means we are actually writing the whole dataset (25,5 GB) with every iteration and read the whole dataset every other time. This is because if you modify a chunk, you have to read it first if it is not cached (I assume a chunk-cache-size below 25,5 GB here).
So what can we improve here?
In such a case we have to make a compromise between write/read speed and the memory which is used by the chunk-cache.
An assumption which will give both decent/read and write speed:
- We choose a chunk-size of (100, 1000)
- If we want to iterate over the first Dimension we need at least (1000*639038*4 ->2,55 GB) cache to avoid additional IO-overhead as described above and (100*10000*4 -> 0,4 MB).
- So we should provide at least 2,6 GB chunk-data-cache in this example.
Conclusion
There is no generally right chunk size or shape, it depends heavily on the task which one to use. Never choose your chunk size or shape without making some minds about the chunk-cache. RAM is orders of magnite faster than the fastest SSD in regards of random read/write.
Regarding your problem
I would simply read the random rows, the improper chunk-cache-size is your real problem.
Compare the performance of the following code with your version:
import h5py as h5
import time
import numpy as np
def ReadingAndWriting():
File_Name_HDF5='Test.h5'
#shape = (639038, 10000)
shape = (639038, 1000)
chunk_shape=(100, 1000)
Array=np.array(np.random.rand(shape[0]),np.float32)
#We are using 4GB of chunk_cache_mem here ("rdcc_nbytes")
f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")
#Writing columns
t1=time.time()
for i in range(0,shape[1]):
d[:,i:i+1]=np.expand_dims(Array, 1)
f.close()
print(time.time()-t1)
# Reading random rows
# If we read one row there are actually 100 read, but if we access a row
# which is already in cache we would see a huge speed up.
f = h5.File(File_Name_HDF5,'r',rdcc_nbytes=1024**2*4000,rdcc_nslots=1e7)
d = f["Test"]
for j in range(0,639):
t1=time.time()
# With more iterations it will be more likely that we hit a already cached row
inds=np.random.randint(0, high=shape[0]-1, size=1000)
for i in range(0,inds.shape[0]):
Array=np.copy(d[inds[i],:])
print(time.time()-t1)
f.close()
The simplest form of fancy slicing
I wrote in the comments, that I couldn't see this behavior in recent versions. I was wrong. Compare the following:
def Writing():
File_Name_HDF5='Test.h5'
#shape = (639038, 10000)
shape = (639038, 1000)
chunk_shape=(100, 1000)
Array=np.array(np.random.rand(shape[0]),np.float32)
# Writing_1 normal indexing
###########################################
f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*4000)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")
t1=time.time()
for i in range(shape[1]):
d[:,i:i+1]=np.expand_dims(Array, 1)
f.close()
print(time.time()-t1)
# Writing_2 simplest form of fancy indexing
###########################################
f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")
#Writing columns
t1=time.time()
for i in range(shape[1]):
d[:,i]=Array
f.close()
print(time.time()-t1)
This gives on my HDD 34 seconds for the first version and 78 seconds for the second version.
dataset[i]
is slower thandataset[i:i+1]
? I find that surprising: do you have a reference for this? According to the h5py documentation (docs.h5py.org/en/latest/high/dataset.html#reading-writing-data), both are examples of "simple slicing." I'm going to give chunk shape (1, 10000) a go. Thanks for that idea. – jpp