I am trying to read data from hdf5 file - I previously saved to it using recarray. A row of data is of following type: 2x u2(flags) followed by 2x u4(timestamps) and 32x u2(data).
self.flags = np.empty((self.size, 2), dtype="u2")
self.t0 = np.empty(self.size, dtype="u4")
self.t1 = np.empty(self.size, dtype="u4")
self.data = np.empty((self.size, 32), dtype="u2")
...
labels = ['lost events','overwritten events', 't0', 't1'] + ["data_{0}".format(i) for i in range(32)]
result_arr = np.rec.fromarrays(tuple(self.flags.T)+(self.t0, self.t1) + tuple(self.data.T), names=labels)
file.create_dataset('dataset_name', data=result_arr)
Now I would like to iterate over part of this file (the data part - last 32 columns) row by row and be able to process it as I would a usual numpy.array.
data = self.dataset[row_n]
def parseDataToFlags(data):
return np.array(list(data)[4:36], dtype="u2")
This is working but utterly slow. I am looking for a proper way to do this as I will be dealing with big data files. I also tried to mess with this:(self.dataset is a h5py dataset loaded from file)
def get(self, index):
if not (0 <= index < self.n_of_rows):
raise IndexError
return type(self.dataset['t0', 't1'][index])
but it fails when I try to put [data_{0}".format(i) for i in range(32)] in place of 't0','t1'.
I made several attemps to parse the data to structured array but no luck so far.
How should I attempt the reading process properly? Should I change the access order (columns before rows) or is there a way to parse these data to a proper type after I read the row?
UPDATE I got some help and here is how it ended up: What was so slow in my code wasn't the creation of list and parsing to numpy array for every row. Access to data in h5py file was. So it's better to access it once and parse them all at once.
self.flags = np.vstack((self.dataset['lost events'], self.dataset['overwritten events'])).T
self.time = np.vstack((self.dataset['t0'], self.dataset['t1'])).T
self.output = np.vstack([self.dataset['data_'+str(i)] for i in range(32)]).T
Once I used that the code speed up almost 1000 times.
h5pyFile()
(likeself
in your code above). Withh5py
, data sets are accessed with the h5file object and dataset name:self['dset_name']
. From there, you can use numpy slicing notation to get the rows and columns of interest, like this:self['dset_name'][4:36,'col_i_name','col_j_name','col_k_name']
– kcw78labels = [ ]
, for example:'t0', 'data_0', 'data_1'
, etc. Look in the answer below for an expanded version of hpaulj's example. It adds more fields and data, then shows how to slice by row# or field name – kcw78