1
votes

I am trying to make a table of all my numerical variables (i.e. feature) in the following format:

Feature | Count | % Missing | Cardinality | Min. | 1st Quartile | Mean | Median | 3rd Quartile | Max. | Std. Dev. |

--------|-------|-----------|-------------|------|--------------|------|--------|--------------|------|-----------| | | | | | | | | | | |

So each row signifies a specific numeric variable and each column the statistics shown above (Count, % Missing, Cardinality, Min., 1st Quartile, Mean, Median, 3rd Quartile, Max. Std. Dev.)

Say my dataset is called Mashable and my numerical variables are called X, Y and Z. How would I create this table?

Thanks in advance!

3

3 Answers

3
votes

If you're using dplyr already, you can make use of long shaped data and grouping, and treat all the functions you need as summarizations. That lets you scale easily, so it's the same workflow for 3 variables as it is for 25 or 100. It also makes it relatively quick to apply whatever functions you want.

I made dummy data with x, y, and z, then bound onto it a couple rows of NAs just to show the missing value count. Gather it to long data, group by the variable, then use whatever summary functions you want. I started out the first several you named. This gives you the format you asked for.

library(tidyverse)

tibble(
  x = rnorm(100, mean = 1, sd = 1),
  y = rnorm(100, mean = 10, sd = 1),
  z = rexp(100, rate = 0.01)
) %>%
  bind_rows(tibble(x = c(NA, NA), y = c(NA, NA), z = c(NA, NA))) %>%
  gather(key = variable, value = value) %>%
  group_by(variable) %>%
  summarise(
    count = n(),
    missing = sum(is.na(value)),
    share_missing = missing / count,
    mean = mean(value, na.rm = T),
    sd = sd(value, na.rm = T),
    q1 = quantile(value, 0.25, na.rm = T)
  )
#> # A tibble: 3 x 7
#>   variable count missing share_missing    mean     sd     q1
#>   <chr>    <int>   <int>         <dbl>   <dbl>  <dbl>  <dbl>
#> 1 x          102       2        0.0196   0.997  1.08   0.246
#> 2 y          102       2        0.0196   9.81   0.962  9.10 
#> 3 z          102       2        0.0196 106.    90.6   39.9

Created on 2018-05-20 by the reprex package (v0.2.0).

1
votes

You can use dplyr and tidyr packages transform your data. Let me take an example to demonstrate transformation for 3 statistics(e.g. mean,median,sd). You can add other statistics like count, % missing etc similar way.

# Sample data
df <- data.frame(X = 1:10, Y = 11:20, Z = 101:110)

library(dplyr)
library(tidyr)

df %>% summarize_all(funs(mean,median,sd)) %>%  #Statistics for mean, median,sd
  gather(key, value) %>%    #Convert to long format
  separate(key, c("feature", "stat")) %>%  #separate out feature from function
  spread(stat, value)

#    feature  mean median      sd
# 1        X   5.5    5.5 3.02765
# 2        Y  15.5   15.5 3.02765
# 3        Z 105.5  105.5 3.02765
0
votes

If X, Y and Z are the columns of your dataset, then you could just use the apply function to each column in your dataset in R and this will print each of your required statistics e.g.:

apply(dat,2,function(x) c(length(which(is.na(x)==TRUE)),summary(x)))

I mean in general you would specify the data-frame in the first argument. Then select the rows/columns or list (e.g. 2 is column, 1 is row for a matrix/df) depending on the type of structure you are using the function on. And finally a function to apply to each of those rows/columns of list as specified in the prior argument. In this case we used the c function to output both the length of NA and a summary of the data (min, max, quartiles). By extension we can specify any argument here, so in general:

apply("Insert your data-frame","What part of the data-frame",function(x) c(a(x),b(x),...))

It would be preferable to read the R help file for any additional requirements!

Hope it helps; Cheers.