1
votes

I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)

# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)

I personally thought of "cor" or "all.equal" and I did as follows, but did not work.

indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))

what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)

similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)

I would also love to see how then this is different from correlated. do they result in identical results ?

3
Do you mean similar as in correlated or similar as in the difference in their values is within some threshold? It would be helpful if you could elaborate a bit more on what you mean by similar.Alex A.
Elementwise? Would you consider 1,2,3,4 and 1.1,2.1,3.1,4.1 similar? How about 1,2,3,4 and 4,1,2,3?statespace
Thanks to @Alex and others, I think it would not make so much difference to find those that highly correlated with those that are not different within some threshold. my main idea is to find those that highly similar within a threshold but I would love to see whether the results are different when we check for correlation or not. Anyway, I updated the questionuser1267127
I might be thinking it wrong, but I see a simple linear regression (treat columns as time series) and resulting summary output is exactly what you need. If order doesn't matter then sort them ascending beforehand.statespace
I suggest you calculate the distance matrix. Start with dist(t(Mat)).Roland

3 Answers

1
votes

Using correlation you could try (with a simpler matrix for demonstration)

set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)

corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)

data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
  row col correlation
1   8   1   0.3387738
2   6   2   0.3350891

Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738

0
votes

I'd take linear regression approach:

Mat<- matrix(rexp(200, rate=.1), ncol=100, nrow=400)
combinations <- combn(1:ncol(Mat), m = 2)
sigma <- NULL
for(i in 1:ncol(combinations)){
  sigma <- c(sigma, summary(lm(Mat[,combinations[1,1]] ~ Mat[,combinations[2,1]]))$sigma)
}
sigma <- data.frame(sigma = sigma, comb_nr = 1:ncol(combinations))

And residual standard error as an optional criteria. You can further order data frame by sigma and get best/worst combinations.

0
votes

If you want a (not so elegant) straightforward approach that's likely to be very slow for matrices of your size, you can do this:

set.seed(1)

Mat <- matrix(runif(40000), ncol=100, nrow=400)

col.combs <- t(combn(1:ncol(Mat), 2))

similar <- data.frame(Col1=NULL, Col2=NULL, Corr=NULL, Pct.Diff=NULL)

# Compare each pair of columns
for (k in 1:nrow(col.combs)) {
    i <- col.combs[k, 1]
    j <- col.combs[k, 2]

    # Difference within threshold?
    diff.thresh <- (abs(Mat[, i] - Mat[, j]) < 0.5)

    pair.corr <- cor(Mat[, 1], Mat[, 2])

    if (mean(diff.thresh) > 0.75)
        similar <- rbind(similar, c(i, j, pair.corr, 100*mean(diff.thresh)))
}

In this example there are 2590 distinct pairs of columns with more than 75% of their values within 0.5 of each other (elementwise). You can check the actual difference and correlation coefficient by looking at the resulting data frame.

> head(similar)
   Col1  Col2         Corr Pct.Diff
1     1     2 -0.003187894    76.75
2     1     3  0.074061019    76.75
3     1     4  0.082668387    78.00
4     1     5  0.001713751    75.50
5     1     8  0.052228907    75.75
6     1    12 -0.017921978    78.00

Perhaps it's not the best solution, but gets the job done.

Also, if you're unsure why I used mean(diff.thresh), it's because the sum of a logical vector is the number of TRUE elements. The mean is the sum divided by the length, which means that in this case it's the fraction of values within the threshold.