3
votes

I have several HDF5 files that contain the same two datasets each, data and labels. These datasets are multidimensional arrays and the first is dimension is the same for both.

I would like to combine the HDF5 files into one file and I think the best way would be to create a virtual dataset, [h5py reference], [HDF5 tutorial in C++]. However, I have not found any example in Python and h5py.

Is there any alternative to the virtual dataset or do you know of any example using h5py?

3

3 Answers

3
votes

This is an old question but anyway...

Virtual datasets have only just appeared (20 Dec 2018) fully in h5py v2.9

They have this example of creating a virtual dataset: https://github.com/h5py/h5py/blob/master/examples/vds_simple.py

I also did some experimenting to concatenate the data sets that the example creates. This just creates a 1D array.

import h5py
import numpy as np

file_names_to_concatenate = ['1.h5', '2.h5', '3.h5', '4.h5']
entry_key = 'data' # where the data is inside of the source files.

sources = []
total_length = 0
for i, filename in enumerate(file_names_to_concatenate):
    with h5py.File(file_names_to_concatenate[i], 'r') as activeData:
        vsource = h5py.VirtualSource(activeData[entry_key])
        total_length += vsource.shape[0]
        sources.append(vsource)

layout = h5py.VirtualLayout(shape=(total_length,),
                            dtype=np.float)

offset = 0
for vsource in sources:
    length = vsource.shape[0]
    layout[offset : offset + length] = vsource
    offset += length

with h5py.File("VDS_con.h5", 'w', libver='latest') as f:
    f.create_virtual_dataset(entry_key, layout, fillvalue=0)
0
votes

take gdal virtual format for a try.

0
votes

Someone has tried it. Example is here, but unfortunately I was not able to get it to work and also it seems to be syntactically incorrect. https://github.com/aaron-parsons/h5py/blob/1e467f6db3df23688e90f44bde7558bde7173a5b/docs/vds.rst#using-the-vds-feature-from-h5py

f = h5py.File("VDS.h5", 'w', libver='latest')
file_names_to_concatenate = ['1.h5', '2.h5', '3.h5', '4.h5', '5.h5']
entry_key = 'data' # where the data is inside of the source files.
sh = h5.File(file_names_to_concatenate[0],'r')[entry_key].shape # get the first ones shape.

TGT = h5.VirtualTarget(outfile, outkey, shape=(len(file_names_to_concatenate, ) + sh)

for i in range(num_projections):
    VSRC = h5.VirtualSource(file_names_to_concatenate[i]), entry_key, shape=sh)
    VM = h5.VirtualMap(VSRC[:,:,:], TGT[i:(i+1):1,:,:,:],dtype=np.float)
    VMlist.append(VM)

d = f.create_virtual_dataset(VMlist=VMlist,fillvalue=0)
f.close()