Hide box and whiskers in geom_boxplot() when N is small

Question

I frequently make boxplots where some of the categories are quite small and others have plentiful data, superimposed with jittered raw datapoints. I'm looking for a reliable way to hide the box and whiskers for categories that are very small (N<5). The goal is that those little categories would show just the raw data using a geom_point() layer, but the categories where it makes sense would get the box-and-whisker treatment. The thing that seemed obvious to me, mapping alpha in the geom_boxplot() layer to a factor variable based on N, does not work because alpha only controls the fill and maybe the outliers in geom_boxplot, not the box and whiskers.

I have found a kludgey solution in the past that worked as long as I was willing to waste the color parameter on this problem. However, often I want to actually use color for something else, and mapping it twice leads to gnarly output. Another kludgey solution that occurs to me is using a data subset from which small categories have been deleted - the problem with this plan is that it won't correctly handle situations when these categories are subject to position_dodge() (as the dodge will "see" too few categories).

Minimal example below.

df <- data.frame(group=factor(sample(c("A","B"), size=110, replace=TRUE)),
                 sex=factor(c(rep("M",50), rep("F", 50), rep("NB", 10))),
                 height=c(rnorm(50, 70, 6), rnorm(50, 63, 6), rnorm(10, 65, 6)))

dfsub <- filter(df, !(sex=="NB" & group=="A"))

ggplot(df, aes(x=group, y=height, colour=sex)) +
  geom_boxplot(data=dfsub) +
  geom_point(position=position_jitterdodge(jitter.width=0.2))

heds1 heds1 · Accepted Answer · 2019-10-09T22:13:12

Okay, I don't think this way is necessarily any better than your current options, but... You could split your df into dfs for the boxplot and the scatterplot, and modify the values of the data you want removed from the boxplot to be way out of range (e.g., 1000 here). Then plot both, and finally use coord_cartesian to zoom in on the relevant section.

To create the df_box, we group by group and sex, and change the values of groups with < 5 datapoints to 1000 (so that we don't have to hard-code in which values to change).

df <- data.frame(group=factor(sample(c("A","B"), size=110, replace=TRUE)),
                 sex=factor(c(rep("M",50), rep("F", 50), rep("NB", 10))),
                 height=c(rnorm(50, 70, 6), rnorm(50, 63, 6), rnorm(10, 65, 6)))

df_box <- df %>%
    group_by(group, sex) %>%
    mutate(temp = ifelse(n() < 5, 1000, 1)) %>%
    ungroup() %>%
    mutate(height = ifelse(temp == 1000, 1000, height)) %>%
    select(-temp)

ggplot(df, aes(x=group, y=height, colour=sex)) +
    geom_boxplot(data=df_box) +
    geom_point(position=position_jitterdodge(jitter.width=0.2)) +
    coord_cartesian(ylim=c(50,90))

Hide box and whiskers in geom_boxplot() when N is small

2 Answers