3
votes

I have a large sparse matrix ("dgCMatrix", dimension 5e+5 x 1e+6). I need to count for each column how many non-zero values there are and make a list of column names with only 1 non-zero entry.

My code works for small matrices, but becomes too computationally intensive for the actual matrix I need to work on.

library(Matrix)
set.seed(0)
mat <- Matrix(matrix(rbinom(200, 1, 0.10), ncol = 20))
colnames(mat) <- letters[1:20]

entries <- colnames(mat[, nrow(mat) - colSums(mat == 0) == 1])

Any suggestion is very welcome!

2

2 Answers

2
votes

I have a large sparse matrix ("dgCMatrix")

Let us call it dgCMat.

I need to count for each column how many non-zero values there are

xx <- diff(dgCMat@p)

and make a list of column names with only 1 non-zero entry

colnames(dgCMat)[xx == 1]

summary

nnz: number of non-zeros

For a "dgCMatrix" dgCMat:

## nnz per column
diff(dgCMat@p)

## nnz per row
tabulate(dgCMat@i + 1)

For a "dgRMatrix" dgRMat:

## nnz per column
tabulate(dgRMat@j + 1)

## nnz per row
diff(dgRMat@p)

For a "dgTMatrix" dgTMat:

## nnz per column
tabulate(dgTMat@j + 1)

## nnz per row
tabulate(dgTMat@i + 1)

I did not read your original code when posting this answer. So I did not know that you got stuck with the use of mat == 0. Only till later I added the difference between mat == 0 and mat != 0 in your answer.

Your workaround using mat != 0 well exploits the package's feature. That same line of code should work with other sparse matrix classes, too. Mine goes straight to the internal storage, hence different versions are required for different classes.

3
votes

Similar results are produced using the following: Please notice the provided comments:

## `mat != 0` returns a "lgCMatrix" which is sparse
## don't try `mat == 0` as that is dense, simply because there are too many zeros
entries <- colnames(mat)[colSums(mat != 0) == 1]