
I have a huge list of numpy arrays, specifically 113287, where each array is of shape 36 x 2048. In terms of memory, this amounts to 32 Gigabytes.

As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.

How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.

Here's how I index into hdf5 file:

In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')

In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'

In [6]: group_key = list(hf.keys())[0]

In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">

# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)

Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?

Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)

You can put your list in a big 3d array, in which case memory mapping might help. I have no idea about practical implications and speed, hence only a comment.
You are using a chunked dataset. In most cases you need to adapt the chunk-cache size to get resonable performance. It would be also good to adapt the chunksize to your needs. It shouldn't be a problem to get a few hundret MB/s if your sequential disk IO-Speed is fast enough. stackoverflow.com/a/48405220/4045774

According to Out[7], "img_feats" is a large 3d array. (113287, 36, 2048) shape.

Define ds as the dataset (doesn't load anything):

ds = hf[group_key]

x = ds[0]    # should be a (36, 2048) array

arr = ds[:]   # should load the whole dataset into memory.
arr = ds[:n]   # load a subset, slice 

According to h5py-reading-writing-data :

HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.

I don't see any point in wrapping that in list(); that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays.

h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.

You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.


One way would be to put each sample into its own group and index directly into those. I am thinking the conversion takes long because it tries to load the entire data set into a list (which it has to read from disk). Re-organizing the h5 file such that

  • group
    • sample
      • 36 x 2048 may help in indexing speed.