4
votes

What's a good Python library for manipulating very large matrices (e.g. millions of rows/columns), including the ability to add rows or columns at any stage of the matrix's life?

I had looked at pytables and h5py, but neither support adding or removing rows or columns once the matrix is created.

The only other thing I could find was the sparse matrix functionality in numpy/scipy noted in these questions. However, the ability to add/remove rows and columns seems possible but officially unsupported and a bit hacky, so I'm fearing the performance would be horrible with a real dataset. Also, it includes several different sparse matrix implementations, so I'm confused which one would be best (e.g. lil_matrix vs csc_matrix vs csr_matrix).

1

1 Answers

2
votes

If your matrix is sparse you can add or remove rows or columns without hackying with scipy.sparse. If you want to remove columns (do column slicing) you should go for csc_matrix, while the csr_matrix should be used for efficient row slicing. Usually it is convenient to create the sparse matrix using the coo_matrix type, where you can specify the row, col and data for each non-zero entry:

m = coo_matrix((data, (row, col)), shape=(nrow, ncol))
m = m.to_csr()[rows_to_keep, :]
m = m.to_csc()[:, cols_to_keep]

where rows_to_keep can be a list or a 1-D array with the indices to keep.

If you need a dense matrix you can use perhaps the numpy.memmap() array. To create one you can do:

a = np.memmap('test.memmap', dtype='float64', mode='w+', shape=(1000, 1000))
a.fill(100.)

To read one you can do:

a = np.memmap('a.memmap', dtype='float64', mode='r+', shape=(1000, 1000))

If you want to remove or add rows and columns you have to create a second memmap array and then assign the columns that you want from the original one:

b = np.memmap('b.memmap', dtype='float64', mode='w+', shape=(3, 1000))
b = a[[0, 99, 199], :]

this will save in b the first, 100th and 200th rows of a, with all the columns.