0
votes

Box plots can be handy to summarize continuous data, however, boxplots for rare subgroups (n<10) are not always helpful. I was wondering if it would be possible to replace a boxplot with the raw data points in a grouped boxplot for those groups that are rare?

Example:

library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot()

Produces a box plot of hwy (continuous) by each class (car type). However, looking at the frequencies for each class, we see that there are only 5 2seaters and 11 minivans. Instead of the box plot for 2seaters and minivans I'd like to see the raw data (points, potentially jittered), but keep the box plot for the other groups that meet the artificially set minimum sample size (eg n=20).

table(mpg$class)

   2seater    compact    midsize    minivan     pickup subcompact        suv       
         5         47         41         11         33         35         62  

Is that even possible?

Cheers, Luc

2

2 Answers

1
votes

Here is how this can be done. You can change the value from 20 to whatever you like.

# loading the needed libraries
library(tidyverse)

# adding a new column containing count information
(mpg <- mpg %>%
    dplyr::group_by(.data = ., class) %>%
    dplyr::mutate(.data = ., n = dplyr::n()))
#> # A tibble: 234 x 12
#> # Groups:   class [7]
#>    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4      1.8  1999     4 auto~ f        18    29 p     comp~
#>  2 audi         a4      1.8  1999     4 manu~ f        21    29 p     comp~
#>  3 audi         a4      2    2008     4 manu~ f        20    31 p     comp~
#>  4 audi         a4      2    2008     4 auto~ f        21    30 p     comp~
#>  5 audi         a4      2.8  1999     6 auto~ f        16    26 p     comp~
#>  6 audi         a4      2.8  1999     6 manu~ f        18    26 p     comp~
#>  7 audi         a4      3.1  2008     6 auto~ f        18    27 p     comp~
#>  8 audi         a4 q~   1.8  1999     4 manu~ 4        18    26 p     comp~
#>  9 audi         a4 q~   1.8  1999     4 auto~ 4        16    25 p     comp~
#> 10 audi         a4 q~   2    2008     4 manu~ 4        20    28 p     comp~
#> # ... with 224 more rows, and 1 more variable: n <int>

# plot
ggplot(data = mpg, mapping = aes(x = class, y = hwy, color = class)) +
  # plotting jittered points
  geom_jitter(size = 3, alpha = 0.5, width = 0.15) +
  # adding boxplots only for class with more than a certain count value
  geom_boxplot(data = dplyr::filter(.data = mpg, n > 20), alpha = 0.5)

Created on 2018-08-23 by the reprex package (v0.2.0.9000).

1
votes

This solution only plots the points for the small dataset sizes (as requested) and box plots only for the larger classes:

library(ggplot2)
library(dplyr)

min_n <- 20

mpg %>% 
  group_by(class) %>% 
  mutate(class_count = n()) %>% 
  ggplot(mapping = aes(class, hwy, color = class)) +
  geom_jitter(data = . %>% filter(class_count < min_n)) +
  geom_boxplot(data = . %>% filter(class_count >= min_n))

enter image description here

Something you might also want to have a look at is geom_violin which adds more information about the data distribution and I find more informative than a boxplot (and you can have both :) ):

mpg %>% 
  group_by(class) %>% 
  mutate(class_count = n()) %>% 
  ggplot(mapping = aes(class, hwy, color = class)) +
  geom_jitter(data = . %>% filter(class_count < min_n)) +
  geom_violin(data = . %>% filter(class_count >= min_n), scale = "count") +
  geom_boxplot(data = . %>% filter(class_count >= min_n), width = 0.1)

enter image description here