How to remove rows that contain identical pairs in opposite order in 2 columns

Question

In a correlation matrix I would like to get rid of the rows that are basically containing the same information as another row, except instead of "A" and "B" in var1 and var2 column contain "B" and "A" respectively

   var1 var2      value
1   cyl  mpg -0.8521620
2  disp  mpg -0.8475514
3    wt  mpg -0.8676594
4   mpg  cyl -0.8521620
5  disp  cyl  0.9020329
6    hp  cyl  0.8324475
7    vs  cyl -0.8108118
8   mpg disp -0.8475514
9   cyl disp  0.9020329
10   wt disp  0.8879799
11  cyl   hp  0.8324475
12  mpg   wt -0.8676594
13 disp   wt  0.8879799
14  cyl   vs -0.8108118

Here we could drop for instance row 4 with mpg vs cyl since we have cyl vs mpg in row 1 already

I know I could filter for unique values in column value, BUT i don't want to do this as with my enormous data set there is actually a chance of getting identical correlation score with multiple pairs of columns. So it has to be done by name matching col var1 and var2

I have this code so far to filter out data rows that are above a certain correlation value, but are not 1 (variable vs itself)

mtcars %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>%
  filter(value > 0.8 | value < -0.8) %>%
  filter(value != 1)

EDIT

Andre's answer

cor %>% {(function(x){x[upper.tri(x)]<-NA; x})(.)} %>%

is faster, but Rui's answer is more generic and can be applied to other situations other than cor matrix calculations.

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
   Andre 4.818793 5.113676 5.630160 5.408955 5.704825 22.33730   100  a 
   Rui   5.413692 5.761669 7.531146 6.003656 6.583750 78.02836   100   b

Andre Elrico Andre Elrico · Accepted Answer · 2018-10-23T13:12:35

code:

mtcars %>% 
    as.matrix %>%
    cor %>% {(function(x){x[upper.tri(x)]<-NA; x})(.)} %>%
    as.data.frame %>%
    rownames_to_column(var = 'var1') %>%
    gather(var2, value, -var1) %>%
    filter(value > 0.8 | value < -0.8) %>%
    filter(value != 1)

result:

#  var1 var2      value
#1  cyl  mpg -0.8521620
#2 disp  mpg -0.8475514
#3   wt  mpg -0.8676594
#4 disp  cyl  0.9020329
#5   hp  cyl  0.8324475
#6   vs  cyl -0.8108118
#7   wt disp  0.8879799

tricks used:

use an anonymous function READ MORE
wrap { around anonymous function to prevent default pipe behavior (pipe to first possible position), it seems to work without this step, but I feel safer doing this. READ MORE
remove all duplicate values (?upper.tri) (you can even remove the diagonal in that step, to remove the last code snippet filter(value != 1))

my suggestion:

mtcars %>% 
    as.matrix %>%
    cor %>% {(function(x){x[upper.tri(x, diag = T)]<-NA; x})(.)} %>%
    as.data.frame %>%
    rownames_to_column(var = 'var1') %>%
    gather(var2, value, -var1) %>%
    filter(value > 0.8 | value < -0.8)

How to remove rows that contain identical pairs in opposite order in 2 columns

3 Answers