19
votes

I would like to plot boxplots without outliers with ggplot, giving focus on the boxes and whiskers only

For example:

p1 <- ggplot(diamonds, aes(x=cut, y=price, fill=cut))
p1 + geom_boxplot() + facet_wrap(~clarity, scales="free")

gives facetted boxplots with outliers

enter image description here

I can suppress outliers with outlier.size=NA:

p1 <- ggplot(diamonds, aes(x=cut, y=price, fill=cut))
p1 + geom_boxplot(outlier.size=NA) + facet_wrap(~clarity, scales="free")

which gives

enter image description here

Here, the y-axis scale is the same as in the original plot, just the outliers don't show up. How can I now modify the scale to "zoom in" on each panel according to the whisker ends?

I can reset ylim like this

ylim1 = boxplot.stats(diamonds$price)$stats[c(1, 5)]

and then replot

p1 + geom_boxplot(outlier.size=NA) 
   + facet_wrap(~clarity, scales="free") 
   +  coord_cartesian(ylim = ylim1*1.05)

but this doesn't work on the facets:

enter image description here

Is there a way to "facet_wrap" the boxplots.stats function?

Edit:

I've tried to calculate the boxplot statistics dynamically, but this doesn't seem to work.

give.stats <- function(x){return(boxplot.stats(x)$stats[c(1,5)])}

p1 + geom_boxplot(outlier.size=NA) + 
  facet_wrap(~clarity, scales="free") + 
  coord_cartesian(ylim = give.stats)

> Error in min(x, na.rm = na.rm) : invalid 'type' (list) of argument

Any more ideas would be much appreciated.

4

4 Answers

16
votes

It can be done with stat_summary and custom statistic calculation function:

calc_boxplot_stat <- function(x) {
  coef <- 1.5
  n <- sum(!is.na(x))
  # calculate quantiles
  stats <- quantile(x, probs = c(0.0, 0.25, 0.5, 0.75, 1.0))
  names(stats) <- c("ymin", "lower", "middle", "upper", "ymax")
  iqr <- diff(stats[c(2, 4)])
  # set whiskers
  outliers <- x < (stats[2] - coef * iqr) | x > (stats[4] + coef * iqr)
  if (any(outliers)) {
    stats[c(1, 5)] <- range(c(stats[2:4], x[!outliers]), na.rm = TRUE)
  }
  return(stats)
}

ggplot(diamonds, aes(x=cut, y=price, fill=cut)) + 
    stat_summary(fun.data = calc_boxplot_stat, geom="boxplot") + 
    facet_wrap(~clarity, scales="free")

output figure

The stats calculation function is generic, thus no need for data manipulation before plotting.

It is also possible to set whiskers to 10% and 90% :

calc_stat <- function(x) {
  coef <- 1.5
  n <- sum(!is.na(x))
  # calculate quantiles
  stats <- quantile(x, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
  names(stats) <- c("ymin", "lower", "middle", "upper", "ymax")
  return(stats)
}

ggplot(diamonds, aes(x=cut, y=price, fill=cut)) + 
    stat_summary(fun.data = calc_stat, geom="boxplot") + 
    facet_wrap(~clarity, scales="free")

Output figure with 10% and 90% whiskers

7
votes

Through outlier.size=NA you make the outliers disappear, this is not an option to ignore the outliers plotting the boxplots. So, the plots are generated considering the (invisible) outliers. There seems to be no option for what you want. In order to make the boxplots as you need them I would calculate the quantiles myself and generate the boxplots based on these quantiles, like in the following example:

stat<-tapply(diamonds$price,list(diamonds$cut,diamonds$clarity),function(x) boxplot.stats(x))
stats<-unlist(tapply(diamonds$price,list(diamonds$cut,diamonds$clarity),function(x) boxplot.stats(x)$stats))

df<-data.frame(
  cut=rep(rep(unlist(dimnames(stat)[1]),each=5),length(unlist(dimnames(stat)[2]))),
  clarity=rep(unlist(dimnames(stat)[2]),each=25),
  price=unlist(tapply(diamonds$price,list(diamonds$cut,diamonds$clarity),function(x) boxplot.stats(x)$stats)))

ggplot(df,aes(x=cut,y=price,fill=cut))+geom_boxplot()+facet_wrap(~clarity,scales="free")

Which gives (note that the orders in the plot are different now):

enter image description here

5
votes

Ok, I figured out a more easy way to do this by commenting out some lines in the original ggplot boxplot function and calling the modified function.

I am not a programmer, no idea if this is a good or robust thing to do, but it seems to work fine for now.

This is the modified function I am using:

#modified version of geom_boxplot

require(ggplot2)
geom_boxplot_noOutliers <- function (mapping = NULL, data = NULL, stat = "boxplot",
                          position = "dodge", outlier.colour = NULL,
                          outlier.shape = NULL, outlier.size = NULL,
                          notch = FALSE, notchwidth = .5, varwidth = FALSE,
                          ...) {

  #outlier_defaults <- ggplot2:::Geom$find('point')$default_aes()

  #outlier.colour   <- outlier.colour %||% outlier_defaults$colour
  #outlier.shape    <- outlier.shape  %||% outlier_defaults$shape
  #outlier.size     <- outlier.size   %||% outlier_defaults$size

  GeomBoxplot_noOutliers$new(mapping = mapping, data = data, stat = stat,
                  position = position, outlier.colour = outlier.colour,
                  outlier.shape = outlier.shape, outlier.size = outlier.size, notch = notch,
                  notchwidth = notchwidth, varwidth = varwidth, ...)
}

GeomBoxplot_noOutliers <- proto(ggplot2:::Geom, {
  objname <- "boxplot_noOutliers"

  reparameterise <- function(., df, params) {
    df$width <- df$width %||%
      params$width %||% (resolution(df$x, FALSE) * 0.9)

  # if (!is.null(df$outliers)) {
  #    suppressWarnings({
  #      out_min <- vapply(df$outliers, min, numeric(1))
  #      out_max <- vapply(df$outliers, max, numeric(1))
  #    })
  #    
  #    df$ymin_final <- pmin(out_min, df$ymin)
  #    df$ymax_final <- pmax(out_max, df$ymax)
  #   }

    # if `varwidth` not requested or not available, don't use it
    if (is.null(params) || is.null(params$varwidth) || !params$varwidth || is.null(df$relvarwidth)) {
      df$xmin <- df$x - df$width / 2
      df$xmax <- df$x + df$width / 2
    } else {
      # make `relvarwidth` relative to the size of the largest group
      df$relvarwidth <- df$relvarwidth / max(df$relvarwidth)
      df$xmin <- df$x - df$relvarwidth * df$width / 2
      df$xmax <- df$x + df$relvarwidth * df$width / 2
    }
    df$width <- NULL
    if (!is.null(df$relvarwidth)) df$relvarwidth <- NULL

    df
  }

  draw <- function(., data, ..., fatten = 2, outlier.colour = NULL, outlier.shape = NULL, outlier.size = 2,
                   notch = FALSE, notchwidth = .5, varwidth = FALSE) {
    common <- data.frame(
      colour = data$colour,
      size = data$size,
      linetype = data$linetype,
      fill = alpha(data$fill, data$alpha),
      group = data$group,
      stringsAsFactors = FALSE
    )

    whiskers <- data.frame(
      x = data$x,
      xend = data$x,
      y = c(data$upper, data$lower),
      yend = c(data$ymax, data$ymin),
      alpha = NA,
      common)

    box <- data.frame(
      xmin = data$xmin,
      xmax = data$xmax,
      ymin = data$lower,
      y = data$middle,
      ymax = data$upper,
      ynotchlower = ifelse(notch, data$notchlower, NA),
      ynotchupper = ifelse(notch, data$notchupper, NA),
      notchwidth = notchwidth,
      alpha = data$alpha,
      common)

  #  if (!is.null(data$outliers) && length(data$outliers[[1]] >= 1)) {
  #    outliers <- data.frame(
  #      y = data$outliers[[1]],
  #      x = data$x[1],
  #      colour = outlier.colour %||% data$colour[1],
  #      shape = outlier.shape %||% data$shape[1],
  #      size = outlier.size %||% data$size[1],
  #      fill = NA,
  #      alpha = NA,
  #      stringsAsFactors = FALSE)
  #    outliers_grob <- GeomPoint$draw(outliers, ...)
  #  } else {
      outliers_grob <- NULL
  #  }

    ggname(.$my_name(), grobTree(
      outliers_grob,
      GeomSegment$draw(whiskers, ...),
      GeomCrossbar$draw(box, fatten = fatten, ...)
    ))
  }

  guide_geom <- function(.) "boxplot_noOutliers"
  draw_legend <- function(., data, ...)  {
    data <- aesdefaults(data, .$default_aes(), list(...))
    gp <- with(data, gpar(col=colour, fill=alpha(fill, alpha), lwd=size * .pt, lty = linetype))
    gTree(gp = gp, children = gList(
      linesGrob(0.5, c(0.1, 0.25)),
      linesGrob(0.5, c(0.75, 0.9)),
      rectGrob(height=0.5, width=0.75),
      linesGrob(c(0.125, 0.875), 0.5)
    ))
  }

  default_stat <- function(.) StatBoxplot
  default_pos <- function(.) PositionDodge
  default_aes <- function(.) aes(weight=1, colour="grey20", fill="white", size=0.5, alpha = NA, shape = 16, linetype = "solid")
  required_aes <- c("x", "lower", "upper", "middle", "ymin", "ymax")

})

I saved it as an r file and use source to load it:

library(ggplot2)
library(scales)

#load functions
source("D:/Eigene Dateien/Scripte/R-Scripte/myfunctions/geomBoxplot_noOutliers.r")

Now I can just plot without outliers using geom_boxplot_noOutliers and everything works fine even with facets :-)

p1 <- ggplot(diamonds, aes(x=cut, y=price, fill=cut))
p1 + geom_boxplot_noOutliers() + facet_wrap(~clarity, scales="free")

enter image description here

-3
votes

In your case, I am thinking that limiting the display range could work, since all the outliers are larger than 10000.

p1 + geom_boxplot() + ylim(0,10000)