Finding duplicates over two columns

Question

I have a correlation matrix produced with corr <- cor(data, use = "pairwise.complete.obs"). I used this code to get the data to a long format and filter for correlations > 0.1:

corr %>% 
  as_tibble(rownames = "From") %>% 
  gather(key = "To", value = "corr", -From) %>% 
  filter(!From == To) %>%
  mutate(corr_abs = abs(corr)) %>%
  filter(corr_abs > 0.1) %>% 
  arrange(-corr_abs)

The result, however, mentions every correlation twice. How could I remove those duplicates when the values are in two different columns?

Output

# A tibble: 8 x 4
  From            To                corr corr_abs
  <chr>           <chr>            <dbl>    <dbl>
1 health.age      health.employed -0.393    0.393
2 health.employed health.age      -0.393    0.393
3 health.age      health.marital   0.212    0.212
4 health.marital  health.age       0.212    0.212
5 health.alcohol  health.gender    0.187    0.187
6 health.gender   health.alcohol   0.187    0.187
7 health.age      health.fruitveg  0.100    0.100
8 health.fruitveg health.age       0.100    0.100

Expected

# A tibble: 8 x 4
  From            To                corr corr_abs
  <chr>           <chr>            <dbl>    <dbl>
1 health.age      health.employed -0.393    0.393
2 health.age      health.marital   0.212    0.212
3 health.alcohol  health.gender    0.187    0.187
4 health.age      health.fruitveg  0.100    0.100

Data

corr <- structure(c(1, 0.0632225392922264, 0.0554804788901363, 0.0974838182384356, 
0.212473674076218, -0.0286618705621989, 0.0632225392922264, 1, 
0.0908529910265203, -0.0554639294179715, -0.0326865391045356, 
0.186574369192519, 0.0554804788901363, 0.0908529910265203, 1, 
0.0377351030257117, -0.392764651422931, 0.065822234809157, 0.0974838182384356, 
-0.0554639294179715, 0.0377351030257117, 1, 0.10048775378073, 
-0.0684000695994252, 0.212473674076218, -0.0326865391045356, 
-0.392764651422931, 0.10048775378073, 1, -0.0312405196930598, 
-0.0286618705621989, 0.186574369192519, 0.065822234809157, -0.0684000695994252, 
-0.0312405196930598, 1), .Dim = c(6L, 6L), .Dimnames = list(c("health.marital", 
"health.gender", "health.employed", "health.fruitveg", "health.age", 
"health.alcohol"), c("health.marital", "health.gender", "health.employed", 
"health.fruitveg", "health.age", "health.alcohol")))

@akrun Hmm. To get from output to expected would require exactly what you demonstrated as an answer to the first dupe link, no? — Maurits Evers
@MauritsEvers. Maybe, but I was trying to avoid that apply loop after creating the whole number of rows. By removing the rows beforehand would be more efficient. In the dupe link, the data is already created, but here, there is an option to manipulate at an earlier stage. Anyway, you are right that it can be a duplicate — akrun
I think the difference is that in my case the answer below is a real improvement. It corrects the data before the duplicates are introduced, avoids the complicated apply + transposing + ordering and can be used with the dplyr logic. — JBGruber
@akrun Yeah, your answer definitely gives a much more comprehensive solution. Very nice (+1). I reopened the question. — Maurits Evers

akrun akrun · Accepted Answer · 2019-01-16T11:38:05

An option would be to replace the upper triangular values in the initial data to NA, then remove it with na.rm = TRUE from gather

corr %>% 
   replace(., upper.tri(., diag = TRUE), NA) %>%
   as_tibble(rownames = "From") %>% 
   gather(key = "To", value = "corr", -From, na.rm = TRUE) %>% 
   mutate(corr_abs = abs(corr)) %>% 
   filter(corr_abs > 0.1) %>% 
   arrange(-corr_abs)
# A tibble: 4 x 4
#  From           To                corr corr_abs
#  <chr>          <chr>            <dbl>    <dbl>
#1 health.age     health.employed -0.393    0.393
#2 health.age     health.marital   0.212    0.212
#3 health.alcohol health.gender    0.187    0.187
#4 health.age     health.fruitveg  0.100    0.100

Finding duplicates over two columns

Output

Expected

Data

1 Answers