10
votes

So, I have a fairly large dataset (Dropbox: csv file) that I'm trying to plot using geom_boxplot. The following produces what appears to be a reasonable plot:

require(reshape2)
require(ggplot2)
require(scales)
require(grid)
require(gridExtra)

df <- read.csv("\\Downloads\\boxplot.csv", na.strings = "*")
df$year <- factor(df$year, levels = c(2010,2011,2012,2013,2014), labels = c(2010,2011,2012,2013,2014))

d <- ggplot(data = df, aes(x = year, y = value)) +
    geom_boxplot(aes(fill = station)) + 
    facet_grid(station~.) +
    scale_y_continuous(limits = c(0, 15)) + 
    theme(legend.position = "none"))
d

However, when you dig a little deeper, problems creep in that freak me out. When I labeled the boxplot medians with their values, the following plot results.

df.m <- aggregate(value~year+station, data = df, FUN = function(x) median(x))
d <- d + geom_text(data = df.m, aes(x = year, y = value, label = value)) 
d

boxplots-with-medians-labelled

The medians plotted by geom_boxplot aren't at the medians at all. The labels are plotted at the correct y-axis value, but the middle hinge of the boxplots are definitely not at the medians. I've been stumped by this for a few days now.

What is the reason for this? How can this type of display be produced with correct medians? How can this plot be debugged or diagnosed?

1
Your example code has an inconsistency in it. You are calling geom_text against temp.m but the median was computed into turb.m. Could this be the issue?vpipkt
Ah! Good call on that... I tried to remove my inconsistencies from the original code, but I missed that one. That error would cause the geom_text layer to fail, but even without the geom_text added to the plot, the medians are still drawn incorrectly on the boxplots.Ryan Pugh
Is the "*" in the value field to be interpreted as NA?vpipkt
And what data type is year in your data frame?vpipkt
I've edited the original post to include the full code to generate the faceted plot. As you can see here, where the labels fail to fall on the boxplot horizontal line, there's a problem. I've gone as far as to pare down the dataset to a single station (discharge), using only 2012 data and I still get the exact same boxplot.Ryan Pugh

1 Answers

11
votes

The solution to this question is in the application of scale_y_continuous. ggplot2 will perform operations in the following order:

  1. Scale Transformations
  2. Statistical Computations
  3. Coordinate Transformations

In this case, because a scale transformation is invoked, ggplot2 excludes data outside the scale limits for the statistical computation of the boxplot hinges. The medians calculated by the aggregate function and used in the geom_text instruction will use the entire dataset, however. This can result in different median hinges and text labels.

The solution is to omit the scale_y_continuous instruction and instead use:

d <- ggplot(data = df, aes(x = year, y = value)) +
geom_boxplot(aes(fill = station)) + 
facet_grid(station~.) +
theme(legend.position = "none")) +
coord_cartesian(y = c(0,15))

This allows ggplot2 to calculate the boxplot hinge stats using the entire dataset, while limiting the plot size of the figure.