16
votes

How to create simple summary statistics using dplyr from multiple variables? Using the summarise_each function seems to be the way to go, however, when applying multiple functions to multiple columns, the result is a wide, hard-to-read data frame.

6

6 Answers

36
votes

Use dplyr in combination with tidyr to reshape the end result.

library(dplyr)
library(tidyr)

df <- tbl_df(mtcars)

df.sum <- df %>%
  select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
  summarise_each(funs(min = min, 
                      q25 = quantile(., 0.25), 
                      median = median, 
                      q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean, 
                      sd = sd))

# the result is a wide data frame
> dim(df.sum)
[1]  1 42

# reshape it using tidyr functions

df.stats.tidy <- df.sum %>% gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, min, q25, median, q75, max, mean, sd) # reorder columns

> print(df.stats.tidy)

   var  min    q25 median  q75  max     mean        sd
1   am  0.0  0.000    0.0  1.0  1.0  0.40625 0.4989909
2 carb  1.0  2.000    2.0  4.0  8.0  2.81250 1.6152000
3  cyl  4.0  4.000    6.0  8.0  8.0  6.18750 1.7859216
4 gear  3.0  3.000    4.0  4.0  5.0  3.68750 0.7378041
5  mpg 10.4 15.425   19.2 22.8 33.9 20.09062 6.0269481
6   vs  0.0  0.000    0.0  1.0  1.0  0.43750 0.5040161
17
votes

I liked paljenczy's idea of just using dplyr/tidy and getting the table in a data.frame/tibble before formatting it. But I ran into robustness issues: Because it relies on parsing variable names it choked on columns with underscores in the names. After trying to fix this within the dplyr framework it seemed like it would always be somewhat fragile because it relied on string parsing.

So in the end I decided on using psych::describe() which is a function designed for exactly this thing. It doesn't do completely arbitrary functions, but pretty much anything one would realistically want to do. A full example duplicating the previous solutions is included below, combining psych::describe() with some tidyverse stuff to get the exact tibble we are looking for.

It is worth noting that this answer has been updated to reflect the changed behavior of as_tibble() with regards to how it handles rownames in data.frames:

library(psych)
library(tidyverse)

# Create an extended version with a bunch of stats 
d.summary.extended <- mtcars %>%
    select(mpg, cyl, vs, am, gear, carb) %>%
    psych::describe(quant=c(.25,.75)) %>%
    as_tibble(rownames="rowname")  %>%
    print()

<OUTPUT>
# A tibble: 6 x 16
  rowname  vars     n     mean        sd median    trimmed     mad   min   max range       skew  kurtosis         se  Q0.25 Q0.75
    <chr> <int> <dbl>    <dbl>     <dbl>  <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl>      <dbl>  <dbl> <dbl>
1     mpg     1    32 20.09062 6.0269481   19.2 19.6961538 5.41149  10.4  33.9  23.5  0.6106550 -0.372766 1.06542396 15.425  22.8
2     cyl     2    32  6.18750 1.7859216    6.0  6.2307692 2.96520   4.0   8.0   4.0 -0.1746119 -1.762120 0.31570933  4.000   8.0
3      vs     3    32  0.43750 0.5040161    0.0  0.4230769 0.00000   0.0   1.0   1.0  0.2402577 -2.001938 0.08909831  0.000   1.0
4      am     4    32  0.40625 0.4989909    0.0  0.3846154 0.00000   0.0   1.0   1.0  0.3640159 -1.924741 0.08820997  0.000   1.0
5    gear     5    32  3.68750 0.7378041    4.0  3.6153846 1.48260   3.0   5.0   2.0  0.5288545 -1.069751 0.13042656  3.000   4.0
6    carb     6    32  2.81250 1.6152000    2.0  2.6538462 1.48260   1.0   8.0   7.0  1.0508738  1.257043 0.28552971  2.000   4.0
</OUTPUT>

# Select stats for comparison with other solutions
d.summary <- d.summary.extended %>%
    select(var=rowname, min, q25=Q0.25, median, q75=Q0.75, max, mean, sd) %>%
    print()

<OUTPUT>
# A tibble: 6 x 8
    var   min    q25 median   q75   max     mean        sd
  <chr> <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>     <dbl>
1   mpg  10.4 15.425   19.2  22.8  33.9 20.09062 6.0269481
2   cyl   4.0  4.000    6.0   8.0   8.0  6.18750 1.7859216
3    vs   0.0  0.000    0.0   1.0   1.0  0.43750 0.5040161
4    am   0.0  0.000    0.0   1.0   1.0  0.40625 0.4989909
5  gear   3.0  3.000    4.0   4.0   5.0  3.68750 0.7378041
6  carb   1.0  2.000    2.0   4.0   8.0  2.81250 1.6152000    
</OUTPUT>
15
votes

A potentially easy solution could created with broom::tidy and purrr::map_df. broom::tidy summarises key objects from statistical ouput into a tibble. purrr::map_df applies function to each element, in this case a column and returns a tibble.

Example

library(tidyverse)
mtcars %>% 
    select(mpg, cyl, vs, am, gear, carb) %>% 
    map_df(.f = ~ broom::tidy(summary(.x)), .id = "variable")

Results

# A tibble: 6 x 7
# variable minimum    q1 median   mean    q3 maximum
# <chr>      <dbl> <dbl>  <dbl>  <dbl> <dbl>   <dbl>
# 1 mpg         10.4  15.4   19.2 20.1    22.8    33.9
# 2 cyl          4     4      6    6.19    8       8  
# 3 vs           0     0      0    0.438   1       1  
# 4 am           0     0      0    0.406   1       1  
# 5 gear         3     3      4    3.69    4       5  
# 6 carb         1     2      2    2.81    4       8  
13
votes

If you want to create a summary table for publication (not for further calculations) you may want to look at the excellent stargazer package.

df <- data.frame(mtcars)
cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
stargazer(
    df[, cols], type = "text", 
    summary.stat = c("min", "p25", "median", "p75", "max", "median", "sd")
)

================================================================
Statistic  Min   Pctl(25) Median Pctl(75)  Max   Median St. Dev.
----------------------------------------------------------------
mpg       10.400  15.430  19.200  22.800  33.900 19.200  6.027
cyl         4       4       6       8       8      6     1.786
vs          0       0       0       1       1      0     0.504
am          0       0       0       1       1      0     0.499
gear        3       3       4       4       5      4     0.738
carb        1       2       2       4       8      2     1.615
----------------------------------------------------------------

You can change type to 'latex' and 'html' as well and save it to file with specifying the file giving 'out' argument.

1
votes

You can achieve the same result using data.table as well. You might consider using it if your table is big.

dt <- data.table(mtcars)

cols <- c('mpg', 'cyl', 'vs', 'am', 'gear', 'carb')
functions <- c('min', 'q25', 'median', 'q75', 'max', 'mean', 'sd')

dt.sum <- dt[
    , 
    lapply(
        .SD, 
        function(x) list(
                min(x), quantile(x, 0.25), median(x), 
                quantile(x, 0.75), max(x), mean(x), sd(x)
        )
    ),
    .SDcols = cols
]

dt.sum
     mpg   cyl     vs     am   gear  carb
1:  10.4     4      0      0      3     1
2: 15.43     4      0      0      3     2
3:  19.2     6      0      0      4     2
4:  22.8     8      1      1      4     4
5:  33.9     8      1      1      5     8
6: 20.09 6.188 0.4375 0.4062  3.688 2.812
7: 6.027 1.786  0.504  0.499 0.7378 1.615

# transpose and provide meaningful names
dt.sum.t <- as.data.table(t(sum))[]
setnames(dt.sum.t, names(dt.sum.t), functions)
dt.sum.t[, var := cols]
setcolorder(dt.sum.t, c("var", functions))

dt.sum.t
    var  min   q25 median  q75  max   mean     sd
1:  mpg 10.4 15.43   19.2 22.8 33.9  20.09  6.027
2:  cyl    4     4      6    8    8  6.188  1.786
3:   vs    0     0      0    1    1 0.4375  0.504
4:   am    0     0      0    1    1 0.4062  0.499
5: gear    3     3      4    4    5  3.688 0.7378
6: carb    1     2      2    4    8  2.812  1.615
1
votes

There's a "new" package called skimr that has a function called skim() that gives wonderful output describing individual variables in a data.fame/tibble.

Try:

skimr::skim(mtcars)

and you'll get: enter image description here

it is customizable and works well with pipes etc. see ?skimr::skim() and vignette("Using_skimr", package = "skimr")