Suppose I have a really big matrix of sparse data, but i'm only interested in looking at a sample of it making it even more sparse. Suppose I also have a dataframe of triples including columns for row/column/value of the data (imported from a csv file). I know I can use the sparseMatrix() function of library(Matrix) to create a sparse matrix using
sparseMatrix(i=df$row,j=df$column,x=df$value)
However, because of my values I end up with a sparse matrix that's millions of rows by tens of thousands of columns (most of which are empty because my subset is excluding most of the rows and columns). All of those zero rows and columns end up skewing some of my functions (take clustering for example -- I end up with one cluster that includes the origin when the origin isn't even a valid point). I'd like to perform the same operation, but using i and j as rownames and colnames. I've tried creating a dense vector, sampling down to the max size and adding values using
denseMatrix <- matrix(0,nrows,ncols,dimnames=c(df$row,df$column))
denseMatrix[as.character(df$row),as.character(df$column)]=df$value
(actually I've been setting it equal to 1 because I'm not interested in the value in this case) but I've been finding it fills in the entire matrix because it takes the cross of all the rows and columns rather than just row1*col1, row2*col2... Does anybody know a way to accomplish what I'm trying to do? Alternatively i'd be fine with filling in a sparse matrix and simply having it somehow discard all of the zero rows and columns to compact itself into a denser form (but I'd like to maintain some reference back to the original row and column numbers) I appreciate any suggestions!
Here's an example:
> rows<-c(3,1,3,5)
> cols<-c(2,4,6,6)
> mtx<-sparseMatrix(i=rows,j=cols,x=1)
> mtx
5 x 6 sparse Matrix of class "dgCMatrix"
[1,] . . . 1 . .
[2,] . . . . . .
[3,] . 1 . . . 1
[4,] . . . . . .
[5,] . . . . . 1
I'd like to get rid of colums 1,3 and 5 as well as rows 2 and 4. This is a pretty trivial example, but imagine if instead of having row numbers 1, 3 and 5 they were 1000, 3000 and 5000. Then there would be a lot more empty rows between them. Here's what happens when I using a dense matrix with named rows/columns
> dmtx<-matrix(0,3,3,dimnames=list(c(1,3,5),c(2,4,6)))
> dmtx
2 4 6
1 0 0 0
3 0 0 0
5 0 0 0
> dmtx[as.character(rows),as.character(cols)]=1
> dmtx
2 4 6
1 1 1 1
3 1 1 1
5 1 1 1