3
votes

I have a reasonable size (18GB compressed) HDF5 dataset and am looking to optimize reading rows for speed. Shape is (639038, 10000). I will be reading a selection of rows (say ~1000 rows) many times, located across the dataset. So I can't use x:(x+1000) to slice rows.

Reading rows from out-of-memory HDF5 is already slow using h5py since I have to pass a sorted list and resort to fancy indexing. Is there a way to avoid fancy indexing, or is there a better chunk shape/size I can use?

I have read rules of thumb such as 1MB-10MB chunk sizes and choosing shape consistent what I'm reading. However, building a large number of HDF5 files with different chunk shapes for testing is computationally expensive and very slow.

For each selection of ~1,000 rows, I immediately sum them to get an array of length 10,000. My current dataset looks like this:

'10000': {'chunks': (64, 1000),
          'compression': 'lzf',
          'compression_opts': None,
          'dtype': dtype('float32'),
          'fillvalue': 0.0,
          'maxshape': (None, 10000),
          'shape': (639038, 10000),
          'shuffle': False,
          'size': 2095412704}

What I have tried already:

  • Rewriting the dataset with chunk shape (128, 10000), which I calculate to be ~5MB, is prohibitively slow.
  • I looked at dask.array to optimise, but since ~1,000 rows fit easily in memory I saw no benefit.
1
Some performance Tests regarding chunk-size with continous reading. stackoverflow.com/a/44961222/4045774 In your case (random access) i would put each row in an extra chunk. Only whole chunks can be read or written!! Another performance bottleneck is usually the very small default value for the chunk-cache-size. Regarding the fancy indexing i would try to manually read the data row for row without fancy indexing (accessing a continous 2D-Array. (even data=dataset[i,:] is a kind of fancy indexing, data=dataset[i:i+1,:] would be much faster)max9111
@max9111, So dataset[i] is slower than dataset[i:i+1]? I find that surprising: do you have a reference for this? According to the h5py documentation (docs.h5py.org/en/latest/high/dataset.html#reading-writing-data), both are examples of "simple slicing." I'm going to give chunk shape (1, 10000) a go. Thanks for that idea.jpp
Sorry, I did performance tests about 1,5 years ago and measured a huge performance drop by getting a subset of a dataset with different number of dimensions (much like fancy). It looks like, that this behaivior isn't there anymore. I have another question: You are reading the data row-wise. How do you write the data (supposing that the data is to big to fit in your RAM). This info is necessary for finding a good balance between read and write speed. And is there a at least moderate possibility that you read a row twice?max9111
@max9111, No problem, it's good to hear other people are interested in HDF5. I write one line at a time, but write speed is not a concern as my use case is write once - read many times. The functions that we have to run on this data will mean we will be reading rows multiple times for different uses (at different times). However, some rows are often grouped together. So I'm planning on splitting the big dataset into separate groups/datasets to read in memory as much as possible. For what it's worth 1x10000 chunks cause the file size to blow up, so it's a no-go.jpp
I am already done with an answer...max9111

1 Answers

9
votes

Finding the right chunk cache size

At first I want to discuss some general things. It is very important to know that each individual chunk could only be read or written as a whole. The standard chunk-cache size of h5py which can avoid excessive disk I/Os is only one MB per default and should in many cases be increased, which will be discussed later on.

As an example:

  • We have a dset with shape (639038, 10000), float32 (25,5 GB uncompressed)
  • we want to write our data column wise dset[:,i]=arr and read it row wise arr=dset[i,:]
  • we choose a completely wrong chunk-shape for this type of work ie (1,10000)

In this case reading speed won't be to bad (although the chunk size is a little small) because we read only the data we are using. But what happens when we write on that dataset? If we access a column one floating point number of each chunk is written. This means we are actually writing the whole dataset (25,5 GB) with every iteration and read the whole dataset every other time. This is because if you modify a chunk, you have to read it first if it is not cached (I assume a chunk-cache-size below 25,5 GB here).

So what can we improve here? In such a case we have to make a compromise between write/read speed and the memory which is used by the chunk-cache.

An assumption which will give both decent/read and write speed:

  • We choose a chunk-size of (100, 1000)
  • If we want to iterate over the first Dimension we need at least (1000*639038*4 ->2,55 GB) cache to avoid additional IO-overhead as described above and (100*10000*4 -> 0,4 MB).
  • So we should provide at least 2,6 GB chunk-data-cache in this example.

Conclusion There is no generally right chunk size or shape, it depends heavily on the task which one to use. Never choose your chunk size or shape without making some minds about the chunk-cache. RAM is orders of magnite faster than the fastest SSD in regards of random read/write.

Regarding your problem I would simply read the random rows, the improper chunk-cache-size is your real problem.

Compare the performance of the following code with your version:

import h5py as h5
import time
import numpy as np

def ReadingAndWriting():
    File_Name_HDF5='Test.h5'

    #shape = (639038, 10000)
    shape = (639038, 1000)
    chunk_shape=(100, 1000)
    Array=np.array(np.random.rand(shape[0]),np.float32)

    #We are using 4GB of chunk_cache_mem here ("rdcc_nbytes")
    f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
    d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

    #Writing columns
    t1=time.time()
    for i in range(0,shape[1]):
        d[:,i:i+1]=np.expand_dims(Array, 1)

    f.close()
    print(time.time()-t1)

    # Reading random rows
    # If we read one row there are actually 100 read, but if we access a row
    # which is already in cache we would see a huge speed up.
    f = h5.File(File_Name_HDF5,'r',rdcc_nbytes=1024**2*4000,rdcc_nslots=1e7)
    d = f["Test"]
    for j in range(0,639):
        t1=time.time()
        # With more iterations it will be more likely that we hit a already cached row
        inds=np.random.randint(0, high=shape[0]-1, size=1000)
        for i in range(0,inds.shape[0]):
            Array=np.copy(d[inds[i],:])
        print(time.time()-t1)
    f.close()

The simplest form of fancy slicing

I wrote in the comments, that I couldn't see this behavior in recent versions. I was wrong. Compare the following:

def Writing(): File_Name_HDF5='Test.h5'

#shape = (639038, 10000)
shape = (639038, 1000)
chunk_shape=(100, 1000)
Array=np.array(np.random.rand(shape[0]),np.float32)

# Writing_1 normal indexing
###########################################
f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*4000)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

t1=time.time()
for i in range(shape[1]):
    d[:,i:i+1]=np.expand_dims(Array, 1)

f.close()
print(time.time()-t1)

# Writing_2 simplest form of fancy indexing
###########################################
f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

#Writing columns
t1=time.time()
for i in range(shape[1]):
    d[:,i]=Array

f.close()
print(time.time()-t1)

This gives on my HDD 34 seconds for the first version and 78 seconds for the second version.