I have a correlation matrix produced with corr <- cor(data, use = "pairwise.complete.obs"). I used this code to get the data to a long format and filter for correlations > 0.1:
corr %>%
as_tibble(rownames = "From") %>%
gather(key = "To", value = "corr", -From) %>%
filter(!From == To) %>%
mutate(corr_abs = abs(corr)) %>%
filter(corr_abs > 0.1) %>%
arrange(-corr_abs)
The result, however, mentions every correlation twice. How could I remove those duplicates when the values are in two different columns?
Output
# A tibble: 8 x 4
From To corr corr_abs
<chr> <chr> <dbl> <dbl>
1 health.age health.employed -0.393 0.393
2 health.employed health.age -0.393 0.393
3 health.age health.marital 0.212 0.212
4 health.marital health.age 0.212 0.212
5 health.alcohol health.gender 0.187 0.187
6 health.gender health.alcohol 0.187 0.187
7 health.age health.fruitveg 0.100 0.100
8 health.fruitveg health.age 0.100 0.100
Expected
# A tibble: 8 x 4
From To corr corr_abs
<chr> <chr> <dbl> <dbl>
1 health.age health.employed -0.393 0.393
2 health.age health.marital 0.212 0.212
3 health.alcohol health.gender 0.187 0.187
4 health.age health.fruitveg 0.100 0.100
Data
corr <- structure(c(1, 0.0632225392922264, 0.0554804788901363, 0.0974838182384356,
0.212473674076218, -0.0286618705621989, 0.0632225392922264, 1,
0.0908529910265203, -0.0554639294179715, -0.0326865391045356,
0.186574369192519, 0.0554804788901363, 0.0908529910265203, 1,
0.0377351030257117, -0.392764651422931, 0.065822234809157, 0.0974838182384356,
-0.0554639294179715, 0.0377351030257117, 1, 0.10048775378073,
-0.0684000695994252, 0.212473674076218, -0.0326865391045356,
-0.392764651422931, 0.10048775378073, 1, -0.0312405196930598,
-0.0286618705621989, 0.186574369192519, 0.065822234809157, -0.0684000695994252,
-0.0312405196930598, 1), .Dim = c(6L, 6L), .Dimnames = list(c("health.marital",
"health.gender", "health.employed", "health.fruitveg", "health.age",
"health.alcohol"), c("health.marital", "health.gender", "health.employed",
"health.fruitveg", "health.age", "health.alcohol")))
applyloop after creating the whole number of rows. By removing the rows beforehand would be more efficient. In the dupe link, the data is already created, but here, there is an option to manipulate at an earlier stage. Anyway, you are right that it can be a duplicate - akrun