Problem renaming all HDF5 datasets in group for large hdf5 files

Question

I am having a problem renaming datasets in hdf5. The process is EXTREMELY slow. I read some documentation stating that dataset names are merely links to the data, so an acceptable way to rename is:

group['new_name'] = group['old_name']
del group['old_name']

But this is so slow (only 5% complete running overnight), it makes me think my process is entirely wrong.

I'm using python h5py, and here's my slow code:

# Open file
with h5py.File('test.hdf5') as f:

    # Get all top level groups
    top_keys = [key for key in f.keys()]

    # Iterate over each group
    for top_key in top_keys:
        group = f[top_key]
        tot_digits = len(group)

        #Rename all datasets in the group (pad with zeros)
        for key in tqdm(group.keys()):
            new_key = str(key)
            while len(new_key)<tot_digits:
                new_key = '0'+str(new_key)
            group[new_key] = group[key]
            del group[key]

Per @jpp suggestion, I also tried replacing the last two lines with group.move:

group.move(key, new_key)

But this method was equally slow. I have several groups with the same number of datasets, but each group has different size datasets. The group with the largest datasets (most bytes) seem to rename the slowest.

Certainly there is a way to do this quickly. Is the dataset name just a symbolic link? Or does renaming inherently cause the entire dataset to be rewritten? How should I go about renaming many datasets in an HDF5 file?

How many datasets do you have in your group? It would be nice if you had some code to create a trivial HDF5 file so we can benchmark against it (and demonstrate your problem at the same time). — jpp
I have just short of 1M datasets in each group, and my hdf5 file is ~20GB, so sharing the dataset is difficult. The key question has more to do with how named datasets behave. Is the name just a symbolic link? Or does renaming inherently cause the entire dataset to be rewritten? @jpp — Richard
Not sure if this is relevant still but I've had issues in the past naming HDF5 groups with names starting with a numerical value, maybe try a different naming scheme if nothing else seems to work. — Joules
@Joules My groups are named with letters, but my datasets are named with numbers. Have you had issues with dataset names as well? Or just groups? — Richard
IIRC it wouldn't let me save a dataset or a group with a name which had a number as its first character. I may have been using the pandas HDF5 though, so I'm not sure if it would react the same way with h5py. — Joules

ilmarinen ilmarinen · Accepted Answer · 2018-11-07T22:42:56

One possible culprit, at least if you have a large number of groups under your top level keys, is that your are creating the new name in a very inefficient way. Instead of

while len(new_key)<tot_digits:
    new_key = '0'+str(new_key)

You should generate the new key like this:

if len(new_key)<tot_digits:
    new_key = (tot_digits-len(new_key))*'0' + new_key

This way you don't create a new string object for every extra digit you need to add.

It is also possible, although I can't confirm this, that calling group.keys() will return an iterator which will get repopulated with the new key names you add, since you modify the group while iterating over the keys. A standard python iterator would throw a RuntimeError, but it's clear if hf5py would do the same. To be sure you don't have that problem, you can simple make sure you create a list of the keys up-front.

for key in tqdm(list(group.keys())):

Problem renaming all HDF5 datasets in group for large hdf5 files

2 Answers