2
votes

I'm trying to use dplyr::summarize() and dplyr::across() to obtain a tibble with several summary statistics in the rows and the variables in the columns. I was only able to achieve this result by using dplyr::bind_rows(), but I'm wondering if there's a more elegant way to get the same output.

> library(tidyverse)
── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.3     ✔ purrr   0.3.4
✔ tibble  3.1.1     ✔ dplyr   1.0.6
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1
── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
> 
> bind_rows(min = summarize(starwars, across(where(is.numeric), min, 
+       na.rm = TRUE)), 
+   median = summarize(starwars, across(where(is.numeric), median, 
+       na.rm = TRUE)), 
+   mean = summarize(starwars, across(where(is.numeric), mean, na.rm = TRUE)), 
+   max = summarize(starwars, across(where(is.numeric), max, na.rm = TRUE)), 
+   sd = summarize(starwars, across(where(is.numeric), sd, na.rm = TRUE)), 
+   .id = "statistic")
# A tibble: 5 x 4
  statistic height   mass birth_year
  <chr>      <dbl>  <dbl>      <dbl>
1 min         66     15          8  
2 median     180     79         52  
3 mean       174.    97.3       87.6
4 max        264   1358        896  
5 sd          34.8  169.       155. 

Why can't one do it with summarize directly? Seems more elegant than using a list of functions, as suggested by the colwise vignette. Does this violate the principles of a tidy data frame? (It seems to me that staking a bunch of data frames besides one another is far less tidy.)

4
Each row in your desired output is defined differently. Row 1 is the min, row 2 is the median, etc. So while this may be convenient to work with, you wouldn't operate on entire columns of that output (e.g., you wouldn't sum over height). So I am not sure that output is considered tidy. The way summarize does it of giving you the wide output is probably more "tidy", but I understand why you want to work with it that way. A lot is philosophy and just understanding what you are trying to do with the data.Adam
That is an excellent point. Do you think it'd be tidier if I had variables in the rows and statistics on the columns? That'd also be fine for the purposes of presentation.Lucas De Abreu Maia
You want to store and work with data in a tidy format. For presentation, do what communicates it the best. What you have is probably fine for presentation. I wouldn't stress too much about having your presentation tables "tidy".Adam

4 Answers

6
votes

Here is a way using purrr to iterate over a list of functions. This is effectively what you were doing with bind_rows(), but in less code.

library(dplyr)
library(purrr)

funs <- lst(min, median, mean, max, sd)

map_dfr(funs,
        ~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
        .id = "statistic")

# # A tibble: 5 x 4
#   statistic height   mass birth_year
#   <chr>      <dbl>  <dbl>      <dbl>
# 1 min         66     15          8  
# 2 median     180     79         52  
# 3 mean       174.    97.3       87.6
# 4 max        264   1358        896  
# 5 sd          34.8  169.       155.
2
votes

This resolves in the output you want, but it's not that fancy.

starwars %>% 
  summarise(across(
    where(is.numeric),
    .fns = list(
       min = min,
       median = median, 
       mean = mean, 
       max = max, 
       sd = sd
    ), 
    na.rm = TRUE, 
    .names = "{.col}_{.fn}")) %>% 
  pivot_longer(cols = everything()) %>% 
  mutate(statistic = str_match(name, pattern = ".+_(.+)")[,2],
         name = str_match(name, pattern = "(.+)_.+")[,2]) %>% 
  pivot_wider(names_from = name, values_from = value)
1
votes

You could use gtsummary to summarize the data. Below I subset to numeric columns (although gtsummary handles many different data types. Then I tell the type argument to put my summary stats on different rows and finally tell the statistics argument which summaries I want to display.

library(tidyverse)
library(gtsummary)

starwars[sapply(starwars, is.numeric)] %>% 
    tbl_summary(type = all_continuous() ~ "continuous2",
                statistic = all_continuous() ~ c("{median} ({p25}, {p75})",
                                                 "{min}, {max}",
                                                 "{mean},{sd}"))
1
votes

I would do it this way:

starwars %>%
    summarise(across(where(is.numeric), stat_funs,
        na.rm = TRUE, .names = "{.col}__{.fn}")) %>%
    pivot_longer(everything()) %>%
    separate(name, c('v', 'f'), sep = '__') %>%
    pivot_wider(names_from = v, values_from = value)

#  f      height   mass birth_year
#   <chr>   <dbl>  <dbl>      <dbl>
# 1 min      66     15          8  
# 2 median  180     79         52  
# 3 mean    174.    97.3       87.6
# 4 max     264   1358        896  
# 5 sd       34.8  169.       155.