ggplot2 plot raw data for rare subgroups and boxplots for common subgroups

Question

Box plots can be handy to summarize continuous data, however, boxplots for rare subgroups (n<10) are not always helpful. I was wondering if it would be possible to replace a boxplot with the raw data points in a grouped boxplot for those groups that are rare?

Example:

library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot()

Produces a box plot of hwy (continuous) by each class (car type). However, looking at the frequencies for each class, we see that there are only 5 2seaters and 11 minivans. Instead of the box plot for 2seaters and minivans I'd like to see the raw data (points, potentially jittered), but keep the box plot for the other groups that meet the artificially set minimum sample size (eg n=20).

table(mpg$class)

   2seater    compact    midsize    minivan     pickup subcompact        suv       
         5         47         41         11         33         35         62

Is that even possible?

Cheers, Luc

Indrajeet Patil Indrajeet Patil · Accepted Answer · 2018-08-23T04:56:10

Here is how this can be done. You can change the value from 20 to whatever you like.

# loading the needed libraries
library(tidyverse)

# adding a new column containing count information
(mpg <- mpg %>%
    dplyr::group_by(.data = ., class) %>%
    dplyr::mutate(.data = ., n = dplyr::n()))
#> # A tibble: 234 x 12
#> # Groups:   class [7]
#>    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4      1.8  1999     4 auto~ f        18    29 p     comp~
#>  2 audi         a4      1.8  1999     4 manu~ f        21    29 p     comp~
#>  3 audi         a4      2    2008     4 manu~ f        20    31 p     comp~
#>  4 audi         a4      2    2008     4 auto~ f        21    30 p     comp~
#>  5 audi         a4      2.8  1999     6 auto~ f        16    26 p     comp~
#>  6 audi         a4      2.8  1999     6 manu~ f        18    26 p     comp~
#>  7 audi         a4      3.1  2008     6 auto~ f        18    27 p     comp~
#>  8 audi         a4 q~   1.8  1999     4 manu~ 4        18    26 p     comp~
#>  9 audi         a4 q~   1.8  1999     4 auto~ 4        16    25 p     comp~
#> 10 audi         a4 q~   2    2008     4 manu~ 4        20    28 p     comp~
#> # ... with 224 more rows, and 1 more variable: n <int>

# plot
ggplot(data = mpg, mapping = aes(x = class, y = hwy, color = class)) +
  # plotting jittered points
  geom_jitter(size = 3, alpha = 0.5, width = 0.15) +
  # adding boxplots only for class with more than a certain count value
  geom_boxplot(data = dplyr::filter(.data = mpg, n > 20), alpha = 0.5)

Created on 2018-08-23 by the reprex package (v0.2.0.9000).

ggplot2 plot raw data for rare subgroups and boxplots for common subgroups

2 Answers