I have a large, square matrix containing correlations between each row or column:
mat <- matrix(rnorm(5000 * 5000), nrow = 5000, dimnames = list(
paste0("ID", seq_len(5000)), paste0("ID", seq_len(5000))))
I want to extract the top 100,000 unique pairs from this matrix (i.e., ID1—ID2 is the same as ID2—ID1), then convert them to a data frame. Currently, I am using this code:
corDat <- reshape2::melt(mat, varnames = c("id.A", "id.B"),
value.name = "value", na.rm = T)
corDat <- corDat[as.character(corDat$id.A) < as.character(corDat$id.B),]
corDat <- corDat[order(-corDat$value),]
top_n <- 100000
corDat <- corDat[seq_len(top_n),]
The result:
head(corDat)
id.A id.B value
19316931 ID1931 ID3864 5.658092
14312231 ID2231 ID2863 5.416562
3225529 ID529 ID646 5.357433
3492653 ID2653 ID699 5.297154
17046659 ID1659 ID3410 5.105343
3323364 ID3364 ID665 4.987266
...
However, because the matrix is large, the first two operations above (melt
ing the matrix and removing duplicate pairs) take a very long time, often in excess of 5 minutes. I need to apply this operation to several hundred square matrices of variable size (often larger than 5000x5000).
I am sure there must be a faster way to extract this information, because I do not actually need to melt the entire matrix—just the row and column names associated with the top 100,000 values. How can this operation be accomplished more efficiently?
mat
? – David Heckmann