14
votes

I have a matrix of factors in R and want to convert it to a matrix of dummy variables 0-1 for all possible levels of each factors.

However this "dummy" matrix is very large (91690x16593) and very sparse. I need to store it in a sparse matrix, otherwise it does not fit in my 12GB of ram.

Currently, I am using the following code and it works very fine and takes seconds:

library(Matrix)
X_factors <- data.frame(lapply(my_matrix, as.factor))
#encode factor data in a sparse matrix
X <- sparse.model.matrix(~.-1, data = X_factors)

However, I want to use the e1071 package in R, and eventually save this matrix to libsvm format with write.matrix.csr(), so first I need to convert my sparse matrix to the SparseM format.

I tried to do:

library(SparseM)  
X2 <- as.matrix.csr(X)

but it very quickly fills my RAM and eventually R crashes. I suspect that internally, as.matrix.csr first converts the sparse matrix to a dense matrix that does not fit in my computer memory.

My other alternative would be to create my sparse matrix directly in the SparseM format.
I tried as.matrix.csr(X_factors) but it does not accept a data-frame of factors.

Is there an equivalent to sparse.model.matrix(~.-1, data = X_factors) in the SparseM package? I searched in the documentation but I did not find.

1

1 Answers

20
votes

Quite tricky but I think I got it.

Let's start with a sparse matrix from the Matrix package:

i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
X <- sparseMatrix(i, j, x = x)

The Matrix package uses a column-oriented compression format, while SparseM supports both column and row oriented formats and has functions that can easily handle the conversion from one format to the other.

So we will first convert our column-oriented Matrix into a column-oriented SparseM matrix: we just need to be careful calling the right constructor and noticing that both packages use different conventions for indices (start at 0 or 1):

X.csc <- new("matrix.csc", ra = X@x,
                           ja = X@i + 1L,
                           ia = X@p + 1L,
                           dimension = X@Dim)

Then, change from column-oriented to row-oriented format:

X.csr <- as.matrix.csr(X.csc)

And you're done! You can check that the two matrices are identical (on my small example) by doing:

range(as.matrix(X) - as.matrix(X.csc))
# [1] 0 0