0
votes

I am new to using HDF5 files and I am trying to read files with shapes of (20670, 224, 224, 3). Whenever I try to store the results from the hdf5 into a list or another data structure, it takes either takes so long that I abort the execution or it crashes my computer. I need to be able to read 3 sets of hdf5 files, use their data, manipulate it, use it to train a CNN model and make predictions.

Any help for reading and using these large HDF5 files would be greatly appreciated.

Currently this is how I am reading the hdf5 file:

db = h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5")
training_db = list(db['data'])
3
you should try chunking the data for faster io.Vignesh Pillay

3 Answers

1
votes

Crashes probably mean you are running out of memory. Like Vignesh Pillay suggested, I would try chunking the data and work on a small piece of it at a time. If you are using the pandas method read_hdf you can use the iterator and chunksize parameters to control the chunking:

import pandas as pd
data_iter = pd.read_hdf('/tmp/test.hdf', key='test_key', iterator=True, chunksize=100)
for chunk in data_iter:
   #train cnn on chunk here
   print(chunk.shape)

Note this requires the hdf to be in table format

1
votes

My answer updated 2020-08-03 to reflect code you added to your question. As @Tober noted, you are running out of memory. Reading a dataset of shape (20670, 224, 224, 3) will become a list of 3.1G entities. If you read 3 image sets, it will require even more RAM. I assume this is image data (maybe 20670 images of shape (224, 224, 3) )? If so, you can read the data in slices with both h5py and tables (Pytables). This will return the data as a NumPy array, which you can use directly (no need to manipulate into a different data structure).

Basic process would look like this:

with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5",'r') as db:
     training_db = db['data']
     # loop to get images 1 by 1
     for icnt in range(20670) :
         image_arr = training_db [icnt,:,:,:}

     # then do something with the image

You could also read multiple images by setting the first index to a range (say icnt:icnt+100) then handle looping appropriately.

0
votes

Your problem is arising as you are running out of memory. So, Virtual Datasets come in handy while dealing with large datasets like yours. Virtual datasets allow a number of real datasets to be mapped together into a single, sliceable dataset via an interface layer. You can read more about them here https://docs.h5py.org/en/stable/vds.html

I would recommend you to start from one file at a time. Firstly, create a Virtual Dataset file of your existing data like

with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5", 'r') as db:
     data_shape = db['data'].shape
     layout = h5py.VirtualLayout(shape = (data_shape), dtype = np.uint8)
     vsource = h5py.VirtualSource(db['data'])
     with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'w', libver = 'latest') as file:
         file.create_virtual_dataset('data', layout = layout, fillvalue = 0)
     

This will create a virtual dataset of your existing training data. Now, if you want to manipulate your data, you should open your file in r+ mode like

with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'r+', libver = 'latest') as file:
    # Do whatever manipulation you want to do here

One more thing I would like to advise is make sure your indices while slicing are of int datatype, otherwise you will get an error.