Creating table of variables with specific summary statistics

Question

I am trying to make a table of all my numerical variables (i.e. feature) in the following format:

Feature | Count | % Missing | Cardinality | Min. | 1st Quartile | Mean | Median | 3rd Quartile | Max. | Std. Dev. |

--------|-------|-----------|-------------|------|--------------|------|--------|--------------|------|-----------| | | | | | | | | | | |

So each row signifies a specific numeric variable and each column the statistics shown above (Count, % Missing, Cardinality, Min., 1st Quartile, Mean, Median, 3rd Quartile, Max. Std. Dev.)

Say my dataset is called Mashable and my numerical variables are called X, Y and Z. How would I create this table?

Thanks in advance!

camille camille · Accepted Answer · 2018-05-20T16:57:45

If you're using dplyr already, you can make use of long shaped data and grouping, and treat all the functions you need as summarizations. That lets you scale easily, so it's the same workflow for 3 variables as it is for 25 or 100. It also makes it relatively quick to apply whatever functions you want.

I made dummy data with x, y, and z, then bound onto it a couple rows of NAs just to show the missing value count. Gather it to long data, group by the variable, then use whatever summary functions you want. I started out the first several you named. This gives you the format you asked for.

library(tidyverse)

tibble(
  x = rnorm(100, mean = 1, sd = 1),
  y = rnorm(100, mean = 10, sd = 1),
  z = rexp(100, rate = 0.01)
) %>%
  bind_rows(tibble(x = c(NA, NA), y = c(NA, NA), z = c(NA, NA))) %>%
  gather(key = variable, value = value) %>%
  group_by(variable) %>%
  summarise(
    count = n(),
    missing = sum(is.na(value)),
    share_missing = missing / count,
    mean = mean(value, na.rm = T),
    sd = sd(value, na.rm = T),
    q1 = quantile(value, 0.25, na.rm = T)
  )
#> # A tibble: 3 x 7
#>   variable count missing share_missing    mean     sd     q1
#>   <chr>    <int>   <int>         <dbl>   <dbl>  <dbl>  <dbl>
#> 1 x          102       2        0.0196   0.997  1.08   0.246
#> 2 y          102       2        0.0196   9.81   0.962  9.10 
#> 3 z          102       2        0.0196 106.    90.6   39.9

Created on 2018-05-20 by the reprex package (v0.2.0).

Creating table of variables with specific summary statistics

3 Answers