how to find the most similar columns in a matrix?

Question

I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)

# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)

I personally thought of "cor" or "all.equal" and I did as follows, but did not work.

indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))

what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)

similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)

I would also love to see how then this is different from correlated. do they result in identical results ?

Do you mean similar as in correlated or similar as in the difference in their values is within some threshold? It would be helpful if you could elaborate a bit more on what you mean by similar. — Alex A.
Elementwise? Would you consider 1,2,3,4 and 1.1,2.1,3.1,4.1 similar? How about 1,2,3,4 and 4,1,2,3? — statespace
Thanks to @Alex and others, I think it would not make so much difference to find those that highly correlated with those that are not different within some threshold. my main idea is to find those that highly similar within a threshold but I would love to see whether the results are different when we check for correlation or not. Anyway, I updated the question — user1267127
I might be thinking it wrong, but I see a simple linear regression (treat columns as time series) and resulting summary output is exactly what you need. If order doesn't matter then sort them ascending beforehand. — statespace
I suggest you calculate the distance matrix. Start with dist(t(Mat)). — Roland

DatamineR DatamineR · Accepted Answer · 2015-03-10T14:52:34

Using correlation you could try (with a simpler matrix for demonstration)

set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)

corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)

data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
  row col correlation
1   8   1   0.3387738
2   6   2   0.3350891

Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738

how to find the most similar columns in a matrix?

3 Answers