0
votes

I have two HDF5 files having an identical structure, each store a matrix of the same shape. I need to create a third HDF5 file with a matrix representing the element-wise sum of the two mentioned above matrices. Given the sizes of matrices are extremely large (in the Gb-Tb range), what would be the best way to do it, preferably in a parallel way? I am using the h5py interface to the HDF5 library. Are there any libraries capable of doing it?

1

1 Answers

1
votes

Yes, this is possible. The key is to access slices of the data from file1 & file2, do your element-wise sum, then write that slice of new data to the file3. You can do this with h5py or PyTables (aka tables). No other libraries are required. I only have passing knowledge of parallel computing. I know h5py supports an mpi interface through the mpi4py Python package. Details here: h5py docs: Parallel HDF5

Here is a simple example. It creates 2 files with a dataset of random floats, shape=(10,10,10). It then creates a new file with an empty dataset of the same shape. The loop reads a slice of data from file1 and file2, sums them, then writes to the same slice in file3. To test with large data, you can modify the shapes to match your file.
21-Jan-2021 Update:
I added code to get the dataset shapes from file1 and file2, and compare them (to be sure they are equal). If the shapes aren't equal, I exit. If they match, I create the new file, then create a dataset of matching shape. (If you really want to be robust, you could do the same with the dtype.) I also use the value of shape[2] as the slice iterator over the dataset.

import h5py
import numpy as np
import random
import sys

arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file1.h5','w') as h5fw :
    h5fw.create_dataset('data_1',data=arr)

arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file2.h5','w') as h5fw :
    h5fw.create_dataset('data_2',data=arr)

h5fr1 = h5py.File('file1.h5','r')
f1shape = h5fr1['data_1'].shape
h5fr2 = h5py.File('file2.h5','r')
f2shape = h5fr2['data_2'].shape

if (f1shape!=f2shape):
    print ('Datasets shapes do not match')
    h5fr1.close()
    h5fr2.close()
    sys.exit('Exiting due to error.') 
         
else:
    with h5py.File('file3.h5','w') as h5fw :
        ds3 = h5fw.create_dataset('data_3', shape=f1shape, dtype='f')
    
        for i in range(f1shape[2]):
            arr1_slice = h5fr1['data_1'][:,:,i]
            arr2_slice = h5fr2['data_2'][:,:,i]
            arr3_slice = arr1_slice + arr2_slice
            ds3[:,:,i] = arr3_slice
        
        #     alternately, you can slice and sum in 1 line
        #     ds3[:,:,i] = h5fr1['data_1'][:,:,i] + \
        #                  h5fr2['data_2'][:,:,i]    
            
    print ('Done.')

h5fr1.close()
h5fr2.close()