1
votes

I have a data set with 93 variables and I am trying to check for correlations between/among all variables and then screen for correlations above an absolute value of 0.5. I used How to compute correlations between all columns in R and detect highly correlated variables to help me with this problem and my code looks like this:

library(tibble)
library(dplyr)
library(tidyr)

co_mat = data %>%
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1)

co_mat2 = filter(co_mat, abs(value) > .5)

This worked well except I noticed that I have a lot of instances where a variable is correlated with itself.

I also noticed that there are instances where the same variables were tested for correlation twice but are in different columns (i.e., redundant correlations).

I would like to return a correlation table [matrix] like that in co_mat2 [from my code]. But, I want to eliminate rows where a variable is tested for correlation with itself. I would also like to eliminate rows of redundant correlations.

Thank you in advance.

3

3 Answers

1
votes

You might need something like this:

library(tidyverse)

# get names of dataset
x = names(mtcars)

data.frame(t(combn(x,2)), stringsAsFactors = F) %>%
  rowwise() %>%
  mutate(v = cor(mtcars[,X1], mtcars[,X2]))

# # A tibble: 55 x 3
#   X1    X2         v
#  <chr> <chr>  <dbl>
# 1 mpg   cyl   -0.852
# 2 mpg   disp  -0.848
# 3 mpg   hp    -0.776
# 4 mpg   drat   0.681
# 5 mpg   wt    -0.868
# 6 mpg   qsec   0.419
# 7 mpg   vs     0.664
# 8 mpg   am     0.600
# 9 mpg   gear   0.480
# 10 mpg   carb  -0.551
# # ... with 45 more rows

You get combinations of all columns of your dataset and then for each pair you get the correlation. This will exclude correlations of one variable with itself and correlations between same variables but in different order.

Remember that you can avoid the rowwise part if you vectorise your function, like this

GetCor = function(x,y) cor(mtcars[,x], mtcars[,y])
GetCor = Vectorize(GetCor)

data.frame(t(combn(x,2)), stringsAsFactors = F) %>%
  mutate(v = GetCor(X1,X2))

You can then apply your correlation threshold to column v of the output and then reshape, if you like.

0
votes

Here's one way to get the pairwise correlations of all variables in a data frame:

# generate some fake data, some of whose columns are correlated
set.seed(6933)
dat <- data.frame(
  a = c(rnorm(10, 1), rnorm(10, 2), rnorm(10, 3)), 
  b = c(rnorm(10, 2), rnorm(10, 4), rnorm(10, 6)), 
  c = c(rnorm(10, 5), rnorm(10, 3), rnorm(10, 7)), 
  d = c(rnorm(10, 3), rnorm(10, 2), rnorm(10, 1)), 
  e = c(rnorm(10, 9), rnorm(10, 5), rnorm(10, 2)))

# make a df of all the pairwise combinations of column names (10 rows)
corrs_df <- as.data.frame(t(combn(names(dat), m=2)))

# compute the correlation between each of the columns of `dat`
corrs_df$cor <- apply(corrs_df, 1, function(x) cor(dat[[x[1]]], dat[[x[2]]]))

Then you can filter corrs_df by some value, to get just the set of pairs that are below (or above) some threshold:

threshold <- .5
corrs_df[corrs_df$cor < threshold, ]  # 9 rows remain

You might also filter by whether the absolute value of the correlation is above the threshold, e.g.:

corrs_df[abs(corrs_df$cor) < threshold, ]  # 7 rows remain 
0
votes

If you'll need to do it repeatedly, consider this:

require(tidyverse)
require(Hmisc)

Build function:

flattenCorrMatrix <- function(DF) {
  DF <- DF %>% as.matrix() %>% Hmisc::rcorr()
  ut <- upper.tri(DF$r)
  flat <- data.frame(row = rownames(DF$r)[row(DF$r)[ut]], column = rownames(DF$r)[col(DF$r)[ut]], 
                     cor = (DF$r)[ut], p = DF$P[ut], n = DF$n[ut])
  return(flat)
}

Run on data (I used the mtcars dataset):

mtcars %>% 
  flattenCorrMatrix() %>% 
  filter(cor < abs(0.5))

output:

   row column         cor            p  n
1   mpg    cyl -0.85216196 6.112688e-10 32
2   mpg   disp -0.84755138 9.380328e-10 32
3   mpg     hp -0.77616837 1.787835e-07 32
4   cyl   drat -0.69993811 8.244636e-06 32
5  disp   drat -0.71021393 5.282022e-06 32
6    hp   drat -0.44875912 9.988772e-03 32
7   mpg     wt -0.86765938 1.293958e-10 32
8  drat     wt -0.71244065 4.784260e-06 32
9   mpg   qsec  0.41868403 1.708199e-02 32
10  cyl   qsec -0.59124207 3.660533e-04 32
11 disp   qsec -0.43369788 1.314404e-02 32
12   hp   qsec -0.70822339 5.766253e-06 32
13 drat   qsec  0.09120476 6.195826e-01 32
14   wt   qsec -0.17471588 3.388683e-01 32
15  cyl     vs -0.81081180 1.843018e-08 32
16 disp     vs -0.71041589 5.235012e-06 32
17   hp     vs -0.72309674 2.940896e-06 32
18 drat     vs  0.44027846 1.167553e-02 32
19   wt     vs -0.55491568 9.798492e-04 32
20  cyl     am -0.52260705 2.151207e-03 32
21 disp     am -0.59122704 3.662114e-04 32
22   hp     am -0.24320426 1.798309e-01 32
23   wt     am -0.69249526 1.125440e-05 32
24 qsec     am -0.22986086 2.056621e-01 32
25   vs     am  0.16834512 3.570439e-01 32
26  mpg   gear  0.48028476 5.400948e-03 32
27  cyl   gear -0.49268660 4.173297e-03 32
28 disp   gear -0.55556920 9.635921e-04 32
29   hp   gear -0.12570426 4.930119e-01 32
30   wt   gear -0.58328700 4.586601e-04 32
31 qsec   gear -0.21268223 2.425344e-01 32
32   vs   gear  0.20602335 2.579439e-01 32
33  mpg   carb -0.55092507 1.084446e-03 32
34 disp   carb  0.39497686 2.526789e-02 32
35 drat   carb -0.09078980 6.211834e-01 32
36   wt   carb  0.42760594 1.463861e-02 32
37 qsec   carb -0.65624923 4.536949e-05 32
38   vs   carb -0.56960714 6.670496e-04 32
39   am   carb  0.05753435 7.544526e-01 32
40 gear   carb  0.27407284 1.290291e-01 32