I have a huge list of numpy arrays, specifically 113287
, where each array is of shape 36 x 2048
. In terms of memory, this amounts to 32 Gigabytes.
As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.
How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.
Here's how I index into hdf5 file:
In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')
In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'
In [6]: group_key = list(hf.keys())[0]
In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">
# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)
Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?
Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)