I am dealing with a quite large sparse matrix (1.000.000 * 1.000.000) for my master's thesis. I use R and more precisely the 'Matrix' package dedicated to this issue.
Basically, my question is: Are there existing techniques to fill a sparse matrix without looping over the whole pre-existing dataframe I build the sparse matrix upon ? Is there a more efficient way to extract the indices where my sparse matrix will not take value 0 ?
I'm not the first to ask this question, but most threads concern Matlab or Python, and I didn't find a clear answer.
Let's take an example to shed some light on the issue.
Suppose I have a data frame Df of this form:
list1 <- list("a","b","c")
list2 <- list("b,c","c", NA)
Df <- data.frame(cbind(list1, list2))
Df
list1 list2
1 a b,c
2 b c
3 c NA
I want to build a sparse matrix S that assigns S[i,j]=1 if Df[j,1] appears in Df[i,2]. Thus the matrix S would be of this form:
S=matrix(c(0,0,0,1,0,0,1,1,0), nrow=3)
S
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 0 0 1
[3,] 0 0 0
Is it possible to fill this matrix without writing a loop that scans through each row and column of Df ?
Indeed, at the moment, my code would look like something like this:
row_vector <- vector()
col_vector <- vector()
val_vector <- vector()
count = 0
for (i in 1:3){
for (j in 1:3){
if (grepl(Df[i,1], Df[j,2], fixed=TRUE)==TRUE){
count=count+1
col_vector[count]=i
row_vector[count]=j
val_vector[count]=1
}
}
}
This is fine when I have a small matrix, but might get quite tedious with big matrices.
I hope I was clear enough. I would be grateful for any help or advice.
Regards, Germain
M, what does the dash represent? Why do there seem to be different numbers of columns? Arebandcstrings? If you say I have a matrixM = matrix(c(1, 2, 3, 4), nrow = 2)there is no ambiguity. You can share the syntax to create it, or if you have an suitable (and small) object in your workspace, just sharedput(M). - Gregor ThomasMwhat you've created asDf? - Spacedman