5
votes

I'd like a box plot that looks just like the one below. But instead of the default, I'd like to present (1) 95% confidence intervals and (2) without the outliers.

The 95% confidence intervals could mean (i) extending the boxes and removing the whiskers, or (ii) having just a mean and whiskers, and removing the boxes. Or if people have other ideas for presenting 95% confidence intervals in a plot like this, I'm open to suggestions. The final goals is to show mean and conf intervals for data across multiple categories on the same plot.

set.seed(1234)
df <- data.frame(cond = factor( rep(c("A","B"), each=200) ), 
                   rating = c(rnorm(200),rnorm(200, mean=.8))
ggplot(df, aes(x=cond, y=rating, fill=cond)) + geom_boxplot() + 
    guides(fill=FALSE) + coord_flip()

enter image description here

Image and code source: http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/

2
This is not a good idea, as it is not a box-plot anymore and might very well be confusing. You can easily have the 1st and 3rd quantiles be outside the confidence interval (which is a function of the sample size) - so the whiskers would be covered up by the box! Why not just use geom_crossbar or geom_errorbar or geom_linerange? Which is basically the answer anyway - just build your own boxplot elements using the different geom's.Andy W
I concur with @AndyW that one should not change boxplot fundamentals. A combination of geom_errorbar and geom_violin might be suitable for your purposes.CMichael
@Jaap Somehow I missed this! Thanks for pinging.Dr. Beeblebrox

2 Answers

8
votes

I've used the following to show a 95% interval. Based on what I've read it's not an uncommon use of box and whisker, but it's not the default, so you do need to make it clear what you're showing in the graph.

quantiles_95 <- function(x) {
  r <- quantile(x, probs=c(0.05, 0.25, 0.5, 0.75, 0.95))
  names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
  r
}

ggplot(df, aes(x=cond, y=rating, fill=cond)) +
    guides(fill=F) +
    coord_flip() +
    stat_summary(fun.data = quantiles_95, geom="boxplot")

enter image description here

Instead of use geom_boxplot, use stat_summary with a custom function that specifies the limits you want to use:

  • "ymin" is the lower limit of the lower whisker
  • "lower" is the lower limit of the lower box
  • "middle" is the middle of the box (typically the median)
  • "upper" is the upper limit of the upper box
  • "ymax" is the upper limit of the upper whisker.

In the provided function (quantiles_95), the builtin quantile function is used with custom probs argument. As given, the whiskers will span 90% of your data: from the bottom 5% to the upper 95%. The boxes will span the middle two quartiles, as usual, from 25% to 75%.

You can always change the custom function to choose different quantiles (or even to not use quantiles), but you need to be very careful with this. As pointed out in a comment, there is a certain expectation when one sees a box and whisker plot. If you're using the same shape plot to convey different information, you're likely to confuse people.

If you want to get rid of the whiskers, make the "ymin" equal to "lower" and the "ymax" equal to "upper". If you want to have all whiskers and no box, set "upper" and "lower" both equal to "middle" (or just use geom_errorbars).

7
votes

You can hide the outliers by setting the size to 0:

ggplot(df, aes(x=cond, y=rating, fill=cond)) + 
  geom_boxplot(outlier.size = 0) + 
  guides(fill=FALSE) + coord_flip()

You can add the mean to the plot with the stat_summary function:

ggplot(df, aes(x=cond, y=rating, fill=cond)) + 
  geom_boxplot(outlier.size = 0) + 
  stat_summary(fun.y="mean", geom="point", shape=23, size=4, fill="white") +
  guides(fill=FALSE) + 
  coord_flip()