1
votes

I want to calculate the pair-wise correlations between "mpg" and all other numeric variables of interest for each cyl in the mtcars dataset. I would like to adopt the tidy data principle.

It's rather easy with corrr::correlate().

library(dplyr)
library(tidyr)
library(purrr)
library(corrr)
data(mtcars)

mtcars2 <- mtcars[,1:7] %>%
  group_nest(cyl) %>%
  mutate(cors = map(data, corrr::correlate),
         stretch = map(cors, corrr::stretch)) %>%
  unnest(stretch)

mtcars2 %>%
  filter(x == "mpg")

By using corrr::correlate(), all available pair-wise correlations have been calculated. I could use dplyr::filter() to select the correlations of interest.

However, when datasets are large, a lot of calculations go to the unwanted correlations, making this approach very time-consuming. So I tried to calculate only mpg vs. others. I'm not very familiar with purrr, and the following code doesn't work.

mtcars2 <- mtcars[,1:7] %>%
  group_nest(cyl) %>%
  mutate(comp = map(data, ~colnames),
         corr = map(comp, ~cor.test(data[["mpg"]], data[[.]])))
2

2 Answers

0
votes

Would this work for you? I have done this in the past but on smallish datasets and have not bench marked it so not sure of performance. I use pivot_longer to reshape the data prior to nesting. The variables you pass essentially work as the filtering step, sort of

mtcars2 <- mtcars[,1:7] %>%
  pivot_longer(c(-mpg, -cyl), names_to = "y.var", values_to = "value" ) %>% 
  group_nest(cyl, y.var) %>%
  mutate(x.var  = "mpg", #just so you  can see this in the output
    cor = map_dbl(data, ~ {cor <- cor.test(.x$mpg, .x$value)
                                cor$estimate})) %>%
  select(data, cyl, x.var , y.var, cor) %>% 
  arrange(cyl, y.var)
0
votes

If you need to use cor.test, below is an option using broom:

library(broom)
library(tidyr)
library(dplyr)

mtcars[,1:7] %>% 
pivot_longer(-c(mpg,cyl)) %>% 
group_by(cyl,name) %>% 
do(tidy(cor.test(.$mpg,.$value)))

# A tibble: 15 x 10
# Groups:   cyl, name [15]
     cyl name  estimate statistic p.value parameter conf.low conf.high method
   <dbl> <chr>    <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl> <chr> 
 1     4 disp   -0.805     -4.07  0.00278         9   -0.947   -0.397  Pears…
 2     4 drat    0.424      1.41  0.193           9   -0.236    0.816  Pears…
 3     4 hp     -0.524     -1.84  0.0984          9   -0.855    0.111  Pears…
 4     4 qsec   -0.236     -0.728 0.485           9   -0.732    0.424  Pears…
 5     4 wt     -0.713     -3.05  0.0137          9   -0.920   -0.198  Pears…
 6     6 disp    0.103      0.232 0.826           5   -0.705    0.794  Pears…
 7     6 drat    0.115      0.258 0.807           5   -0.699    0.799  Pears…

If you just need the correlation, for big datasets, the nesting etc might be costly and unnecessary because you can simply do cor(,) and melt that:

#define columns to correlate
cor_vars = setdiff(colnames(mtcars)[1:7],"cyl")
split(mtcars[,1:7],mtcars$cyl) %>% 
map_dfr(~data.frame(x="mpg",y=cor_vars,
cyl=unique(.x$cyl),rho=as.numeric(cor(.x$mpg,.x[,cor_vars]))))

     x    y cyl         rho
1  mpg  mpg   4  1.00000000
2  mpg disp   4 -0.80523608
3  mpg   hp   4 -0.52350342
4  mpg drat   4  0.42423947
5  mpg   wt   4 -0.71318483
6  mpg qsec   4 -0.23595389
7  mpg  mpg   6  1.00000000
8  mpg disp   6  0.10308269
9  mpg   hp   6 -0.12706785
10 mpg drat   6  0.11471598
11 mpg   wt   6 -0.68154982
12 mpg qsec   6 -0.41871779
13 mpg  mpg   8  1.00000000
14 mpg disp   8 -0.51976704
15 mpg   hp   8 -0.28363567
16 mpg drat   8  0.04793248
17 mpg   wt   8 -0.65035801
18 mpg qsec   8 -0.10433602