2
votes

I frequently make boxplots where some of the categories are quite small and others have plentiful data, superimposed with jittered raw datapoints. I'm looking for a reliable way to hide the box and whiskers for categories that are very small (N<5). The goal is that those little categories would show just the raw data using a geom_point() layer, but the categories where it makes sense would get the box-and-whisker treatment. The thing that seemed obvious to me, mapping alpha in the geom_boxplot() layer to a factor variable based on N, does not work because alpha only controls the fill and maybe the outliers in geom_boxplot, not the box and whiskers.

I have found a kludgey solution in the past that worked as long as I was willing to waste the color parameter on this problem. However, often I want to actually use color for something else, and mapping it twice leads to gnarly output. Another kludgey solution that occurs to me is using a data subset from which small categories have been deleted - the problem with this plan is that it won't correctly handle situations when these categories are subject to position_dodge() (as the dodge will "see" too few categories).

Minimal example below.

df <- data.frame(group=factor(sample(c("A","B"), size=110, replace=TRUE)),
                 sex=factor(c(rep("M",50), rep("F", 50), rep("NB", 10))),
                 height=c(rnorm(50, 70, 6), rnorm(50, 63, 6), rnorm(10, 65, 6)))

dfsub <- filter(df, !(sex=="NB" & group=="A"))

ggplot(df, aes(x=group, y=height, colour=sex)) +
  geom_boxplot(data=dfsub) +
  geom_point(position=position_jitterdodge(jitter.width=0.2))
2

2 Answers

1
votes

Okay, I don't think this way is necessarily any better than your current options, but... You could split your df into dfs for the boxplot and the scatterplot, and modify the values of the data you want removed from the boxplot to be way out of range (e.g., 1000 here). Then plot both, and finally use coord_cartesian to zoom in on the relevant section.

To create the df_box, we group by group and sex, and change the values of groups with < 5 datapoints to 1000 (so that we don't have to hard-code in which values to change).

df <- data.frame(group=factor(sample(c("A","B"), size=110, replace=TRUE)),
                 sex=factor(c(rep("M",50), rep("F", 50), rep("NB", 10))),
                 height=c(rnorm(50, 70, 6), rnorm(50, 63, 6), rnorm(10, 65, 6)))

df_box <- df %>%
    group_by(group, sex) %>%
    mutate(temp = ifelse(n() < 5, 1000, 1)) %>%
    ungroup() %>%
    mutate(height = ifelse(temp == 1000, 1000, height)) %>%
    select(-temp)

ggplot(df, aes(x=group, y=height, colour=sex)) +
    geom_boxplot(data=df_box) +
    geom_point(position=position_jitterdodge(jitter.width=0.2)) +
    coord_cartesian(ylim=c(50,90))

image

1
votes

I made a second column for your height data where values from small sample size groups are replaced with NA. When plotting the data, use the original height column as the y aesthetic for points, and the new column with NA values for small groups as the y aesthetic for boxplots.

To make boxplots and points line up correctly, use geom_boxplot(position_dodge(preserve = "single")) to tell ggplot to maintain a constant width for boxplots even with missing data.

require(tidyverse)

df <- data.frame(group = factor(sample(c("A", "B"), size = 110, replace = TRUE)),
                 sex = factor(c(rep("M", 50), rep("F", 50), rep("NB", 10))),
                 height = c(rnorm(50, 70, 6), rnorm(50, 63, 6), rnorm(10, 65, 6)))

n <- df %>% #calculate sample sizes
  group_by(group, sex) %>%
  summarize(n = n())

df <- left_join(df, n) %>% #join sample sizes to df
  #make second height column to use for boxplots: NA values if n is too small
  mutate(boxplot_height = ifelse(n < 5, NA, height)) 


ggplot(df, aes(x = group, colour = sex)) +
  #use height column that has groups with n < 5 coded as NA to plot boxplots
  geom_boxplot(aes(y = boxplot_height),
               #preserve = "single" maintains constant width of boxes 
               position = position_dodge(preserve = "single")) +
  geom_point(aes(y = height), #use all height data as y variable for points
             position = position_jitterdodge(jitter.width = 0.2))

enter image description here