6
votes

Suppose I have a really big matrix of sparse data, but i'm only interested in looking at a sample of it making it even more sparse. Suppose I also have a dataframe of triples including columns for row/column/value of the data (imported from a csv file). I know I can use the sparseMatrix() function of library(Matrix) to create a sparse matrix using

sparseMatrix(i=df$row,j=df$column,x=df$value)

However, because of my values I end up with a sparse matrix that's millions of rows by tens of thousands of columns (most of which are empty because my subset is excluding most of the rows and columns). All of those zero rows and columns end up skewing some of my functions (take clustering for example -- I end up with one cluster that includes the origin when the origin isn't even a valid point). I'd like to perform the same operation, but using i and j as rownames and colnames. I've tried creating a dense vector, sampling down to the max size and adding values using

denseMatrix <- matrix(0,nrows,ncols,dimnames=c(df$row,df$column))
denseMatrix[as.character(df$row),as.character(df$column)]=df$value

(actually I've been setting it equal to 1 because I'm not interested in the value in this case) but I've been finding it fills in the entire matrix because it takes the cross of all the rows and columns rather than just row1*col1, row2*col2... Does anybody know a way to accomplish what I'm trying to do? Alternatively i'd be fine with filling in a sparse matrix and simply having it somehow discard all of the zero rows and columns to compact itself into a denser form (but I'd like to maintain some reference back to the original row and column numbers) I appreciate any suggestions!

Here's an example:

> rows<-c(3,1,3,5)
> cols<-c(2,4,6,6)
> mtx<-sparseMatrix(i=rows,j=cols,x=1)
> mtx
5 x 6 sparse Matrix of class "dgCMatrix"

[1,] . . . 1 . .
[2,] . . . . . .
[3,] . 1 . . . 1
[4,] . . . . . .
[5,] . . . . . 1

I'd like to get rid of colums 1,3 and 5 as well as rows 2 and 4. This is a pretty trivial example, but imagine if instead of having row numbers 1, 3 and 5 they were 1000, 3000 and 5000. Then there would be a lot more empty rows between them. Here's what happens when I using a dense matrix with named rows/columns

> dmtx<-matrix(0,3,3,dimnames=list(c(1,3,5),c(2,4,6)))
> dmtx
  2 4 6
1 0 0 0
3 0 0 0
5 0 0 0
> dmtx[as.character(rows),as.character(cols)]=1
> dmtx
  2 4 6
1 1 1 1
3 1 1 1
5 1 1 1
4
Can you show a small example, say 10x10, sparse matrix, plus the triplets you might use in that situation, and what subset you want?Gavin Simpson
Have you investigated the SparseM package?Spacedman
Spacedman, I just skimmed through the package doc and I have to say, it's not an easy read. Are you suggesting there happens to be a method burried somewhere in there that does what I'm looking for? As far as I can tell (and from everything i'd read in the past) it seems as though it's just an alternate implementation of sparse matrices from the Matrix library. If you know of a reason that one api would be better than the other in this case, I'm all ears.dscheffy

4 Answers

4
votes

When you say "get rid of" certain columns/rows, do you mean just this:

> mtx[-c(2,4), -c(1,3,5)]
3 x 3 sparse Matrix of class "dgCMatrix"

[1,] . 1 .
[2,] 1 . 1
[3,] . . 1

Subsetting works, so you just need a way of finding out which rows and columns are empty? If that is correct, then you can use colSums() and rowSums() as these have been enhanced by the Matrix package to have appropriate methods for sparse matrices. This should preserve the sparseness during the operation

> dimnames(mtx) <- list(letters[1:5], LETTERS[1:6])
> mtx[which(rowSums(mtx) != 0), which(colSums(mtx) != 0)]
3 x 3 sparse Matrix of class "dgCMatrix"
  B D F
a . 1 .
c 1 . 1
e . . 1

or, perhaps safer

> mtx[rowSums(mtx) != 0, colSums(mtx) != 0]
3 x 3 sparse Matrix of class "dgCMatrix"
  B D F
a . 1 .
c 1 . 1
e . . 1
4
votes

Your code almost works, you just need to cbind together the row names and column names. Each row of the resulting matrix is then treated as a pair instead of treating the rows and the columns separately.

> dmtx <- matrix(0,3,3,dimnames=list(c(1,3,5),c(2,4,6)))
> dmtx[cbind(as.character(rows),as.character(cols))] <- 1
> dmtx
  2 4 6
1 0 1 0
3 1 0 1
5 0 0 1

This may be faster if you use factors.

> rowF <- factor(rows)
> colF <- factor(cols)
> dmtx <- matrix(0, nlevels(rowF), nlevels(colF), 
                 dimnames=list(levels(rowF), levels(colF)))
> dmtx[cbind(rowF,colF)] <- 1
> dmtx
  2 4 6
1 0 1 0
3 1 0 1
5 0 0 1

You can also use these factors in a call to sparseMatrix.

> sparseMatrix(i=as.integer(rowF), j=as.integer(colF), x=1,
+              dimnames = list(levels(rowF), levels(colF)))
3 x 3 sparse Matrix of class "dgCMatrix"
  2 4 6
1 . 1 .
3 1 . 1
5 . . 1

Note that one of the other solutions may be faster; converting to factors can be slow if there's a lot of data.

1
votes

Your first issue stems from the fact that the coordinate list (COO) has non-contiguous values for the row and column indices. When faced with this, or even when dealing with most sparse matrices, I tend to reorder the rows and columns by their support.

You can do this in two ways:

  1. Produce the sparse matrix and the do colSums and rowSums of logical(yourMatrix) to get the support values, or
  2. Use a function like table or bigtabulate (from the bigmemory suite) to calculate the # of unique times that each value has occurred in the coordinate list. (My preference is bigtabulate.)

Once you have the support, you can use the rank function (actually, rank(-1 * support, ties = "first")) to map the original indices to new ones, based on their ranks.

At this point, if you create the matrix with sparseMatrix, it will only produce a matrix with dimensions such that all of your rows and columns have support. It will not map to anything larger.

This is similar to @GavinSimpson's approach, though his method only drops the missing rows and columns, while my approach reorders to put the maximum density in the upper left corner of the matrix, with decreasing density as you move to larger indices for the rows and columns. In order to map back to the original indices in my approach, simply create a pair of mappings: "original to ranked" and "ranked to original", and you can perfectly recreate the original data, if you choose.

0
votes

@Iterator's answer is very helpful for my application, but it's a pity that his/her response didn't include an example to illustrate the idea. Here is my implementation of the idea for reordering the rows and columns of very huge sparse matrix (e.g. with about one million rows and a few thousands of columns on supercomputer with sufficient memory to load the sparse matrix).

library(Matrix)

sparseY <- sparseMatrix( i=sample(2000, 500, replace=TRUE), j=sample(1000,500, replace=TRUE), x=sample(10000,500) )

# visualize the original sparse matrix
image(sparseY, aspect=1, colorkey=TRUE, main="The original sparse matrix")

numObs <- length( sparseY@x )
# replace all non-zero entries with 1 to calculate #non-zero entries per row/column and use rank() to sort based on supports
logicalY <- sparseY; logicalY@x <- rep(1, numObs)

# calculate the number of observed entries per row/column
colObsFreqs <- colSums(logicalY)
rowObsFreqs <- rowSums(logicalY)

colObsFreqs
rowObsFreqs

# get the rank of supports for rows and columns  
colRanks <- rank( -1*colObsFreqs, ties="first" )
rowRanks <- rank( -1*rowObsFreqs, ties="first" )

# Sort the ranks from small to large
sortColInds <- sort(colRanks, index.return=TRUE)
sortRowInds <- sort(rowRanks, index.return=TRUE)

# reorder the original sparse matrix so that the maximum density data block is placed in the upper left corner of the matrix, with decreasing density as you move to larger indices for the rows and columns. 
sparseY <- sparseY[ sortRowInds$ix, sortColInds$ix ]

# visualize the reordered sparse matrix
image(sparseY, aspect=1, colorkey=TRUE, main="The sparse matrix after reordering")

logicalY <- sparseY; logicalY@x <- rep(1, numObs)
# Check whether the resulting sparse matrix is what's expected, i.e. with the maximum density data block placed in the upper left corner of the matrix
colObsFreqs <- colSums(logicalY)
rowObsFreqs <- rowSums(logicalY)

colObsFreqs
rowObsFreqs