4
votes

In a correlation matrix I would like to get rid of the rows that are basically containing the same information as another row, except instead of "A" and "B" in var1 and var2 column contain "B" and "A" respectively

   var1 var2      value
1   cyl  mpg -0.8521620
2  disp  mpg -0.8475514
3    wt  mpg -0.8676594
4   mpg  cyl -0.8521620
5  disp  cyl  0.9020329
6    hp  cyl  0.8324475
7    vs  cyl -0.8108118
8   mpg disp -0.8475514
9   cyl disp  0.9020329
10   wt disp  0.8879799
11  cyl   hp  0.8324475
12  mpg   wt -0.8676594
13 disp   wt  0.8879799
14  cyl   vs -0.8108118

Here we could drop for instance row 4 with mpg vs cyl since we have cyl vs mpg in row 1 already

I know I could filter for unique values in column value, BUT i don't want to do this as with my enormous data set there is actually a chance of getting identical correlation score with multiple pairs of columns. So it has to be done by name matching col var1 and var2

I have this code so far to filter out data rows that are above a certain correlation value, but are not 1 (variable vs itself)

mtcars %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>%
  filter(value > 0.8 | value < -0.8) %>%
  filter(value != 1)

EDIT

Andre's answer

cor %>% {(function(x){x[upper.tri(x)]<-NA; x})(.)} %>%

is faster, but Rui's answer is more generic and can be applied to other situations other than cor matrix calculations.

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
   Andre 4.818793 5.113676 5.630160 5.408955 5.704825 22.33730   100  a 
   Rui   5.413692 5.761669 7.531146 6.003656 6.583750 78.02836   100   b
3

3 Answers

4
votes

code:

mtcars %>% 
    as.matrix %>%
    cor %>% {(function(x){x[upper.tri(x)]<-NA; x})(.)} %>%
    as.data.frame %>%
    rownames_to_column(var = 'var1') %>%
    gather(var2, value, -var1) %>%
    filter(value > 0.8 | value < -0.8) %>%
    filter(value != 1)

result:

#  var1 var2      value
#1  cyl  mpg -0.8521620
#2 disp  mpg -0.8475514
#3   wt  mpg -0.8676594
#4 disp  cyl  0.9020329
#5   hp  cyl  0.8324475
#6   vs  cyl -0.8108118
#7   wt disp  0.8879799

tricks used:

  • use an anonymous function READ MORE
  • wrap { around anonymous function to prevent default pipe behavior (pipe to first possible position), it seems to work without this step, but I feel safer doing this. READ MORE
  • remove all duplicate values (?upper.tri) (you can even remove the diagonal in that step, to remove the last code snippet filter(value != 1))

my suggestion:

mtcars %>% 
    as.matrix %>%
    cor %>% {(function(x){x[upper.tri(x, diag = T)]<-NA; x})(.)} %>%
    as.data.frame %>%
    rownames_to_column(var = 'var1') %>%
    gather(var2, value, -var1) %>%
    filter(value > 0.8 | value < -0.8)
2
votes

Another way is simply to filter by var1 < var2.

mtcars %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>%
  filter(value > 0.8 | value < -0.8) %>%
  filter(value != 1) %>%
  filter(var1 < var2)
#  var1 var2      value
#1  cyl  mpg -0.8521620
#2 disp  mpg -0.8475514
#3  cyl disp  0.9020329
#4  cyl   hp  0.8324475
#5  mpg   wt -0.8676594
#6 disp   wt  0.8879799
#7  cyl   vs -0.8108118
1
votes

Using base:

x <- cor(mtcars)
x[ upper.tri(x, diag = TRUE) | abs(x) < 0.8  ] <- NA
na.omit(data.frame(as.table(x)))
#    Var1 Var2       Freq
# 2   cyl  mpg -0.8521620
# 3  disp  mpg -0.8475514
# 6    wt  mpg -0.8676594
# 14 disp  cyl  0.9020329
# 15   hp  cyl  0.8324475
# 19   vs  cyl -0.8108118
# 28   wt disp  0.8879799

Compared to accepted tidy answer:

microbenchmark::microbenchmark(
  base = {
    x <- cor(mtcars)
    x[ upper.tri(x, diag = TRUE) | abs(x) < 0.8  ] <- NA
    na.omit(data.frame(as.table(x)))
  },
  tidy = {
    mtcars %>% 
      as.matrix %>%
      cor %>% {(function(x){x[upper.tri(x, diag = T)]<-NA; x})(.)} %>%
      as.data.frame %>%
      rownames_to_column(var = 'var1') %>%
      gather(var2, value, -var1) %>%
      filter(value > 0.8 | value < -0.8)    
  })
# Unit: microseconds
# expr      min        lq      mean   median        uq      max neval
# base  683.994  718.1025  790.9333  750.099  796.2825  2288.63   100
# tidy 3278.397 3405.3260 3660.0932 3488.334 3676.3870 10212.20   100