I'm writing a machine learning algorithm on huge & sparse data (my matrix is of shape (347, 5 416 812 801) but very sparse, only 0.13% of the data is non zero.
My sparse matrix's size is 105 000 bytes (<1Mbytes) and is of csr
type.
I'm trying to separate train/test sets by choosing a list of examples indices for each. So I want to split my dataset in two using :
training_set = matrix[train_indices]
of shape (len(training_indices), 5 416 812 801)
, still sparse
testing_set = matrix[test_indices]
of shape (347-len(training_indices), 5 416 812 801)
also sparse
With training_indices
and testing_indices
two list
of int
But training_set = matrix[train_indices]
seems to fail and return a Segmentation fault (core dumped)
It might not be a problem of memory, as I'm running this code on a server with 64Gbytes of RAM.
Any clue on what could be the cause ?
matrix.__getitem__
(the indexing method) to see how it does the selection. Each sparse format does its own indexing.lil
andcsr
should handle row index well.coo
doesn't handle index at all. Indexing sparse matrices isn't hidden in compiled code like it is for arrays (and it isn't as fast). - hpauljcsr
and trying to fetch rows, it should be fine - Doobimport scipy; print(scipy.__version__)
- Warren Weckesser