In a correlation matrix I would like to get rid of the rows that are basically containing the same information as another row, except instead of "A" and "B" in var1 and var2 column contain "B" and "A" respectively
var1 var2 value
1 cyl mpg -0.8521620
2 disp mpg -0.8475514
3 wt mpg -0.8676594
4 mpg cyl -0.8521620
5 disp cyl 0.9020329
6 hp cyl 0.8324475
7 vs cyl -0.8108118
8 mpg disp -0.8475514
9 cyl disp 0.9020329
10 wt disp 0.8879799
11 cyl hp 0.8324475
12 mpg wt -0.8676594
13 disp wt 0.8879799
14 cyl vs -0.8108118
Here we could drop for instance row 4 with mpg vs cyl since we have cyl vs mpg in row 1 already
I know I could filter for unique values in column value, BUT i don't want to do this as with my enormous data set there is actually a chance of getting identical correlation score with multiple pairs of columns. So it has to be done by name matching col var1
and var2
I have this code so far to filter out data rows that are above a certain correlation value, but are not 1 (variable vs itself)
mtcars %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(value > 0.8 | value < -0.8) %>%
filter(value != 1)
EDIT
Andre's answer
cor %>% {(function(x){x[upper.tri(x)]<-NA; x})(.)} %>%
is faster, but Rui's answer is more generic and can be applied to other situations other than cor matrix calculations.
Unit: milliseconds
expr min lq mean median uq max neval cld
Andre 4.818793 5.113676 5.630160 5.408955 5.704825 22.33730 100 a
Rui 5.413692 5.761669 7.531146 6.003656 6.583750 78.02836 100 b