I am having a problem renaming datasets in hdf5. The process is EXTREMELY slow. I read some documentation stating that dataset names are merely links to the data, so an acceptable way to rename is:
group['new_name'] = group['old_name']
del group['old_name']
But this is so slow (only 5% complete running overnight), it makes me think my process is entirely wrong.
I'm using python h5py, and here's my slow code:
# Open file
with h5py.File('test.hdf5') as f:
# Get all top level groups
top_keys = [key for key in f.keys()]
# Iterate over each group
for top_key in top_keys:
group = f[top_key]
tot_digits = len(group)
#Rename all datasets in the group (pad with zeros)
for key in tqdm(group.keys()):
new_key = str(key)
while len(new_key)<tot_digits:
new_key = '0'+str(new_key)
group[new_key] = group[key]
del group[key]
Per @jpp suggestion, I also tried replacing the last two lines with group.move:
group.move(key, new_key)
But this method was equally slow. I have several groups with the same number of datasets, but each group has different size datasets. The group with the largest datasets (most bytes) seem to rename the slowest.
Certainly there is a way to do this quickly. Is the dataset name just a symbolic link? Or does renaming inherently cause the entire dataset to be rewritten? How should I go about renaming many datasets in an HDF5 file?
pandasHDF5 though, so I'm not sure if it would react the same way with h5py. - Joules