5
votes

I'm new to python, coming from matlab. I have a large sparse matrix saved in matlab v7.3 (HDF5) format. I've so far found two ways of loading in the file, using h5py and tables. However operating on the matrix seems to be extremely slow after either. For example, in matlab:

>> whos     
  Name           Size                   Bytes  Class     Attributes

  M      11337x133338            77124408  double    sparse    

>> tic, sum(M(:)); toc
Elapsed time is 0.086233 seconds.

Using tables:

t = time.time()
sum(f.root.M.data)
elapsed = time.time() - t
print elapsed
35.929461956

Using h5py:

t = time.time()
sum(f["M"]["data"])
elapsed = time.time() - t
print elapsed

(I gave up waiting ...)

[EDIT]

Based on the comments from @bpgergo, I should add that I've tried converting the result loaded in by h5py (f) into a numpy array or a scipy sparse array in the following two ways:

from scipy import sparse
A = sparse.csc_matrix((f["M"]["data"], f["M"]["ir"], f["tfidf"]["jc"]))

or

data = numpy.asarray(f["M"]["data"])
ir = numpy.asarray(f["M"]["ir"])
jc = numpy.asarray(f["M"]["jc"])    
    A = sparse.coo_matrix(data, (ir, jc))

but both of these operations are extremely slow as well.

Is there something I'm missing here?

3

3 Answers

3
votes

Most of your problem is that you're using python sum on what's effectively a memory-mapped array (i.e. it's on disk, not in memory).

First off, you're comparing the time it takes to read things from disk to the time it takes to read things in memory. Load the array into memory first, if you want to compare to what you're doing in matlab.

Secondly, python's builtin sum is very inefficent for numpy arrays. (Or, rather, iterating through every item of a numpy array independently is very slow, which is what python's builtin sum is doing.) Use numpy.sum(yourarray) or yourarray.sum() instead for numpy arrays.

As an example:

(Using h5py, because I'm more familiar with it.)

import h5py
import numpy as np

f = h5py.File('yourfile.hdf', 'r')
dataset = f['/M/data']

# Load the entire array into memory, like you're doing for matlab...
data = np.empty(dataset.shape, dataset.dtype)
dataset.read_direct(data)

print data.sum() #Or alternately, "np.sum(data)"
2
votes

The final answer for posterity:

import tables, warnings
from scipy import sparse

def load_sparse_matrix(fname) :
    warnings.simplefilter("ignore", UserWarning) 
    f = tables.openFile(fname)
    M = sparse.csc_matrix( (f.root.M.data[...], f.root.M.ir[...], f.root.M.jc[...]) )
    f.close()
    return M