PyTables vs Matlab HDF5 read times

Question

I have an HDF5 output file from NASTRAN that contains mode shape data. I am trying to read them into Matlab and Python to check various post-processing techniques. The file in question is in the local directory for both of these tests. The file is semi-large at 1.2 GB but certainly not that large in terms of HDF5 files I have read previously. There are 17567342 rows and 8 columns in the table I want to access. The first and last columns are integers the middle 6 are floating point numbers.

Matlab:

file = 'HDF5.h5';
hinfo = hdf5info(file);
% ... Find the dataset I want to extract
t = hdf5read(file, '/NASTRAN/RESULT/NODAL/EIGENVECTOR');

This last operation is extremely slow (can be measured in hours).

Python:

import tables
hfile = tables.open_file("HDF5.h5")
modetable = hfile.root.NASTRAN.RESULT.NODAL.EIGENVECTOR
data = modetable.read()

This last operation is basically instant. I can then access data as if it were a numpy array. I am clearly missing something very basic about what these commands are doing. I'm thinking it might have something to do with data conversion but I'm not sure. If I do type(data) I get back numpy.ndarray and type(data[0]) returns numpy.void.

What is the correct (i.e. speedy) way to read the dataset I want into Matlab?

More important than the type is the dtype and shape of data. — hpaulj

kcw78 kcw78 · Accepted Answer · 2019-02-07T03:39:00

Matt, Are you still working on this problem? I am not a matlab guy, but I am familiar with Nastran HDF5 file. You are right; 1.2 GB is big, but not that big by today's standards.
You might be able to diagnose the matlab performance bottle neck by running tests with different numbers of rows in your EIGENVECTOR dataset. To do that (without running a lot of Nastran jobs), I created some simple code to create a HDF5 file with a user defined # of rows. It mimics the structure of the Nastran Eigenvector Result dataset. See below:

import tables as tb
import numpy as np
hfile = tb.open_file('SO_54300107.h5','w')

eigen_dtype = np.dtype([('ID',int), ('X',float),('Y',float),('Z',float),
                        ('RX',float),('RY',float),('RZ',float), ('DOMAIN_ID',int)])

fsize = 1000.0
isize = int(fsize)
recarr = np.recarray((isize,),dtype=eigen_dtype)

id_arr = np.arange(1,isize+1)
dom_arr = np.ones((isize,), dtype=int)
arr = np.array(np.arange(fsize))/fsize

recarr['ID'] = id_arr
recarr['X'] = arr
recarr['Y'] = arr
recarr['Z'] = arr
recarr['RX'] = arr
recarr['RY'] = arr
recarr['RZ'] = arr
recarr['DOMAIN_ID'] = dom_arr

modetable = hfile.create_table('/NASTRAN/RESULT/NODAL', 'EIGENVECTOR',
                                 createparents=True, obj=recarr )

hfile.close()

Try running this with different values for fsize (# of rows), then attach the HDF5 file it creates to matlab. Maybe you can find the point where performance noticeably degrades.

PyTables vs Matlab HDF5 read times

2 Answers