0
votes

I am trying to read data from hdf5 file - I previously saved to it using recarray. A row of data is of following type: 2x u2(flags) followed by 2x u4(timestamps) and 32x u2(data).

self.flags = np.empty((self.size, 2), dtype="u2")
self.t0 = np.empty(self.size, dtype="u4")
self.t1 = np.empty(self.size, dtype="u4")
self.data = np.empty((self.size, 32), dtype="u2")
...
labels = ['lost events','overwritten events', 't0', 't1'] + ["data_{0}".format(i) for i in range(32)]
result_arr = np.rec.fromarrays(tuple(self.flags.T)+(self.t0, self.t1) + tuple(self.data.T), names=labels)
file.create_dataset('dataset_name', data=result_arr)

Now I would like to iterate over part of this file (the data part - last 32 columns) row by row and be able to process it as I would a usual numpy.array.

data = self.dataset[row_n]
def parseDataToFlags(data):
    return np.array(list(data)[4:36], dtype="u2")

This is working but utterly slow. I am looking for a proper way to do this as I will be dealing with big data files. I also tried to mess with this:(self.dataset is a h5py dataset loaded from file)

    def get(self, index):
        if not (0 <= index < self.n_of_rows):
            raise IndexError
        return type(self.dataset['t0', 't1'][index])

but it fails when I try to put [data_{0}".format(i) for i in range(32)] in place of 't0','t1'.

I made several attemps to parse the data to structured array but no luck so far.

How should I attempt the reading process properly? Should I change the access order (columns before rows) or is there a way to parse these data to a proper type after I read the row?

UPDATE I got some help and here is how it ended up: What was so slow in my code wasn't the creation of list and parsing to numpy array for every row. Access to data in h5py file was. So it's better to access it once and parse them all at once.

self.flags = np.vstack((self.dataset['lost events'], self.dataset['overwritten events'])).T
self.time = np.vstack((self.dataset['t0'], self.dataset['t1'])).T
self.output = np.vstack([self.dataset['data_'+str(i)] for i in range(32)]).T

Once I used that the code speed up almost 1000 times.

2
Earwin, please clarify, "_self.dataset is a h5py dataset _". You get a file object/handle from h5pyFile() (like self in your code above). With h5py, data sets are accessed with the h5file object and dataset name: self['dset_name']. From there, you can use numpy slicing notation to get the rows and columns of interest, like this: self['dset_name'][4:36,'col_i_name','col_j_name','col_k_name']kcw78
Consider modifying your topic tags to add [hdf5] and [h5py] to get improve visibility of your question.kcw78
@kcw78 self.dataset is initialized via self.dataset = h5pyfile['dataset']. I am not familiar with this kind of slicing [4:36,'col_i_name','col_j_name','col_k_name']. What do col_*_name stand for? Do they give names to columns I will get with 4:36 or am I getting it wrong?earvin
I was referring to the different filed names you created with labels = [ ], for example: 't0', 'data_0', 'data_1', etc. Look in the answer below for an expanded version of hpaulj's example. It adds more fields and data, then shows how to slice by row# or field namekcw78

2 Answers

0
votes

I think your task would be simpler with a structured array with a compound dtype like:

In [86]: dt = [('events','u2'),('t0','u4'),('t1','u4'),('data','u2',32)]                                     
In [87]: d = np.zeros(3, dt)                                                                                 
In [88]: d                                                                                                   
Out[88]: 
array([(0, 0, 0, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
       (0, 0, 0, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
       (0, 0, 0, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])],
      dtype=[('events', '<u2'), ('t0', '<u4'), ('t1', '<u4'), ('data', '<u2', (32,))])

The data can be accessed as one 2d array:

In [89]: d['data']                                                                                           
Out[89]: 
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint16)
0
votes

This answer expands on hpaulj's example. It adds more fields and data, creates the HDF5 file, then opens to read and slice by row# and/or field name. I used h5fw and h5fr to show I'm writing to and reading from different file handles/objects. Normally I would not do this.

Notice how print (arr#.dtype) is specific to the sliced data (arr1.dtype is different from arr2.dtype).

import numpy as np
import h5py

# create HDF5 file and add a dataset:
with h5py.File('SO_57460643.h5','w') as h5fw:

    labels = ['lost events','overwritten events', 't0', 't1'] + ["data_{0}".format(i) for i in range(4)]
    dt = np.dtype({'names': labels,
                   'formats': ['u2']*2 + ['u4']*2 + ['u4']*4 })
    nrows=5
    ncols=8                                               
    d = np.zeros(nrows, dt)
    for row in range(nrows) :
        arr_tup = tuple(range(row*ncols,(row+1)*ncols))
        d[row] = arr_tup
    #print (d)
    h5fw.create_dataset('ds_name', data=d)

# open HDF5 for reading only:
with h5py.File('SO_57460643.h5','r') as h5fr:

#  get last row
    arr1 = h5fr['ds_name'][-1]
    print (arr1.dtype)
    print (arr1)

 #  get rows 1-3, fields t0, data_0, data_1, data_2, data_3
    arr2 = h5fr['ds_name'][1:4,'t0', 'data_0', 'data_1', 'data_2', 'data_3']
    print (arr2.dtype)
    print (arr2)