1
votes

I have a problem large problem, and a more specific problem that I'm hoping will--once solved--solve the larger problem. I would really appreciate it if anyone has any ideas for me to try.

Basically I have a huge sparse matrix (about 300k x 150k, originally a Term-Document matrix created with R's {tm} package) that is saved as a simple triplet matrix using the {slam} package and I'm running a function that loops through sets of terms and then subsets it based on those terms. Unfortunately, the subsetting process is prohibitively slow.

In trying to figure out how to subset more quickly, I stumbled on the data.table package, which performed very well in some tests I ran with it. However, when I try to convert my sparse matrix into a data.table, I get

Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

I understand that this is because it tries to convert it to a standard matrix first, which is technically a vector for R, and 300k*150k is well above the .Machine$integer.max.

So my question: does anyone know how to convert a simple triplet matrix into a data.frame or data.table without converting it to a matrix first, thereby avoiding the integer overflow?

If not, does anyone a) have another workaround or b) have any advice on quickly subsetting huge sparse matrices and/or simple triple matrices?

Below is a reproducible example to mess around with. On my machine, the loop, which subsets each of the first 10 rows, takes about 3 secs. Once we get into looping over hundreds of thousands of rows, that get prohibitive quickly. Thanks in advance for the help:

require(slam)
STM <- simple_triplet_matrix(i = as.integer(runif(10000000,1,300000)), 
                  j = as.integer(runif(10000000,1,150000)),
                  v = rep(rnorm(10), 1000000),
                  nrow = 300000,
                  ncol = 150000)

start <- Sys.time()
for (i in 1:10) {
  vec <- as.matrix(STM[,i])
}
Sys.time() - start

Sidenote: notice that if you try STMm <- as.matrix(STM) you get the same overflow error I showed above.

2
a triplet matrix is essentially 3 vectors, containing the coordinates of non-zero elements in the 1st 2 columns, and the value in the 3rd. A possible approach (I didn't test it) would be to use RCpp package to write your subsetting algorithm in C++, passing the triplet to C++ as a list of 3 vectors. Should as fast as anything you can do in R. - dww

2 Answers

4
votes

The STM object is actually just a list, you can subset normally:

STM_DT <- data.table(i = STM$i, j = STM$j, v = STM$v)

This gives:

> STM_DT
               i      j           v
       1: 186598    756  0.34271080
       2: 278329  72334  2.03924976
       3: 178388  32708  1.03925605
       4: 260635 101424  0.05780086
       5: 169321 126202  1.00027529
      ---                          
 9999996:  96209  90019 -1.09341023
 9999997:  54467  16612 -2.08070273
 9999998: 179029  96906 -0.86197333
 9999999: 153017 148731  0.47765003
10000000: 104145 123291  0.24258613

Speed is almost instantaneous

0
votes

Most likely you need something like this.

This example was originally developed to address a more specific problem, how a sparse (but huge) model matrix is appended to a data table.

# New Example
set.seed(0)
df = data.frame(matrix(letters[sample(4,15,replace=TRUE)],5))
mat = Matrix::sparse.model.matrix(~.-1,df)
mat

Example Sparse Matrix:

5 x 8 sparse Matrix of class "dgCMatrix"
  X1a X1b X1c X1d X2b X2c X3b X3c
1   .   1   .   .   1   .   1   .
2   1   .   .   .   .   .   .   1
3   .   .   .   1   .   1   .   1
4   .   .   1   .   .   1   .   .
5   1   .   .   .   1   .   .   .

print(object.size(mat))

3760 bytes

with attributes:

mat@i # 0-based row index

[1] 1 4 0 3 2 0 4 2 3 0 1 2

mat@p # 0-based column start

[1] 0 2 3 4 5 7 9 10 12

Here is how to convert to data table:

# Conversion to Data Table
dt = data.table::data.table(matrix(FALSE,nrow(mat),ncol(mat)))
setnames(dt,colnames(mat))

for(cStart in 1:ncol(mat))
  set(dt, i = mat@i[(mat@p[cStart]:(mat@p[cStart+1L]-1L))+1L]+1L, 
j=colnames(mat)[cStart], value=TRUE)

print(object.size(dt))

2696 bytes

dt[,lapply(.SD, as.integer)]

Returns the desired:

   X1a X1b X1c X1d X2b X2c X3b X3c
1:   0   1   0   0   1   0   1   0
2:   1   0   0   0   0   0   0   1
3:   0   0   0   1   0   1   0   1
4:   0   0   1   0   0   1   0   0
5:   1   0   0   0   1   0   0   0