0
votes

I have a large, square matrix containing correlations between each row or column:

mat <- matrix(rnorm(5000 * 5000), nrow = 5000, dimnames = list(
    paste0("ID", seq_len(5000)), paste0("ID", seq_len(5000))))

I want to extract the top 100,000 unique pairs from this matrix (i.e., ID1—ID2 is the same as ID2—ID1), then convert them to a data frame. Currently, I am using this code:

corDat <- reshape2::melt(mat, varnames = c("id.A", "id.B"),
                       value.name = "value", na.rm = T)
corDat <- corDat[as.character(corDat$id.A) < as.character(corDat$id.B),]
corDat <- corDat[order(-corDat$value),]
top_n <- 100000
corDat <- corDat[seq_len(top_n),]

The result:

head(corDat)

           id.A   id.B    value
19316931 ID1931 ID3864 5.658092
14312231 ID2231 ID2863 5.416562
3225529   ID529  ID646 5.357433
3492653  ID2653  ID699 5.297154
17046659 ID1659 ID3410 5.105343
3323364  ID3364  ID665 4.987266    
...

However, because the matrix is large, the first two operations above (melting the matrix and removing duplicate pairs) take a very long time, often in excess of 5 minutes. I need to apply this operation to several hundred square matrices of variable size (often larger than 5000x5000).

I am sure there must be a faster way to extract this information, because I do not actually need to melt the entire matrix—just the row and column names associated with the top 100,000 values. How can this operation be accomplished more efficiently?

1
When you say "top 100,000 unique pairs", do you mean the upper triangle of the matrix mat?David Heckmann

1 Answers

1
votes

Here is a base R version using indexing:

up <- upper.tri(mat)
inds <- which( up , arr.ind = TRUE )
out <- data.frame( id.A = rownames(mat)[ inds[,1] ] , id.B = rownames(mat)[ inds[,2] ] , value = mat[up] )
#order and select the top ten:
out[order(-out$value),][1:10,]

On my machine this is about 3.5x faster than melt.

I am assuming that you want the upper triangular part of your matrix.

Note that your as.character(corDat$id.A) < as.character(corDat$id.B) statement does NOT select the upper triangular part, because of the way multi-digit string comparisons work (e.g. "ID10" < "ID7" evaluates to TRUE ).