0
votes

I am dealing with a quite large sparse matrix (1.000.000 * 1.000.000) for my master's thesis. I use R and more precisely the 'Matrix' package dedicated to this issue.

Basically, my question is: Are there existing techniques to fill a sparse matrix without looping over the whole pre-existing dataframe I build the sparse matrix upon ? Is there a more efficient way to extract the indices where my sparse matrix will not take value 0 ?

I'm not the first to ask this question, but most threads concern Matlab or Python, and I didn't find a clear answer.

Let's take an example to shed some light on the issue.

Suppose I have a data frame Df of this form:

list1 <- list("a","b","c")
list2 <- list("b,c","c", NA)
Df <- data.frame(cbind(list1, list2))
Df
  list1 list2
1     a   b,c
2     b     c
3     c    NA

I want to build a sparse matrix S that assigns S[i,j]=1 if Df[j,1] appears in Df[i,2]. Thus the matrix S would be of this form:

S=matrix(c(0,0,0,1,0,0,1,1,0), nrow=3)
S
     [,1] [,2] [,3]
[1,]    0    1    1
[2,]    0    0    1
[3,]    0    0    0

Is it possible to fill this matrix without writing a loop that scans through each row and column of Df ?

Indeed, at the moment, my code would look like something like this:

row_vector <- vector()
col_vector <- vector()
val_vector <- vector()

count = 0

for (i in 1:3){
  for (j in 1:3){
    if (grepl(Df[i,1], Df[j,2], fixed=TRUE)==TRUE){
      count=count+1
      col_vector[count]=i
      row_vector[count]=j
      val_vector[count]=1
    }
  }
}

This is fine when I have a small matrix, but might get quite tedious with big matrices.

I hope I was clear enough. I would be grateful for any help or advice.

Regards, Germain

1
Can you use valid R syntax to share you sample data? I've very confused by your "matrix" M, what does the dash represent? Why do there seem to be different numbers of columns? Are b and c strings? If you say I have a matrix M = matrix(c(1, 2, 3, 4), nrow = 2) there is no ambiguity. You can share the syntax to create it, or if you have an suitable (and small) object in your workspace, just share dput(M). - Gregor Thomas
is M what you've created as Df? - Spacedman
Gregor, I've edited my question. It should be clearer now. @Spacedman Yes indeed. I've edited my message accordingly. - G. Gauthier

1 Answers

0
votes

You have to read documentation before asking such question - ?sparseMatrix.

You need to provide 3 vectors to the constructor - row indices, column indices and values. Now you need to think how to get indices and values from your data.