Outliers and scales with boxplots: is my linear model being influenced in ggplot2?

Question

I'm trying to determine if there's a general trend for some data. I'm plotting emission data (in tons) against year across different emission types. I feel like I'm approaching the problem right, but maybe I don't fully understand how removing (or hiding?) outliers influences a linear model fit on boxplots. My approach is to plot some boxplots for the data itself, and overlay a linear model for each facet to understand the general trend for each emission type.

With outliers:

q <- ggplot(balt, aes(year, Emissions))
q + 
  geom_boxplot(aes(color=factor(year))) + 
  facet_grid(.~type)
  geom_smooth(method='lm')

produces:

Without outliers:

q + 
  geom_boxplot(aes(color=factor(year)), outlier.shape=NA) + 
  facet_grid(.~type)
  geom_smooth(method='lm')

produces:

Clearly I want to resize the Y axis, so I set the upper Y limit to what looks like 80, since the top tail of the 1999 boxplot for "NONPOINT" looks like it comes pretty close to there from the previous two plots:

q + 
  geom_boxplot(aes(color=factor(year)), outlier.shape=NA) + 
  scale_y_continuous(limits=c(0,80)) +
  facet_grid(.~type)
  geom_smooth(method='lm')

produces:

To demonstrate that it's resizing, I reset the upper Y limit to 60, in which the "NONPOINT" 1999 boxplot is clearly crossing:

The warnings I'm getting for the final plot read as:

Warning messages:
1: Removed 4 rows containing non-finite values (stat_boxplot).
2: Removed 24 rows containing non-finite values (stat_boxplot).
3: Removed 9 rows containing non-finite values (stat_boxplot).
4: Removed 4 rows containing missing values (stat_smooth).
5: Removed 24 rows containing missing values (stat_smooth).
6: Removed 9 rows containing missing values (stat_smooth).
7: Removed 9 rows containing missing values (geom_point).
8: Removed 17 rows containing missing values (geom_point).
9: Removed 17 rows containing missing values (geom_point).
10: Removed 17 rows containing missing values (geom_point).
11: Removed 1 rows containing missing values (geom_point).
12: Removed 1 rows containing missing values (geom_point).
13: Removed 1 rows containing missing values (geom_point).
14: Removed 2 rows containing missing values (geom_point).
15: Removed 20 rows containing missing values (geom_point).
16: Removed 52 rows containing missing values (geom_point).
17: Removed 59 rows containing missing values (geom_point).
18: Removed 41 rows containing missing values (geom_point).
19: Removed 1 rows containing missing values (geom_point).
20: Removed 7 rows containing missing values (geom_point).
21: Removed 10 rows containing missing values (geom_point).
22: Removed 43 rows containing missing values (geom_point).
18: Removed 10 rows containing missing values (geom_point).
19: Removed 43 rows containing missing values (geom_point).

I'm not quite what to make of the of the non-finite values, but the rest of the warnings look like they're just removing outliers? I could be wrong here, but I wouldn't know how to guess otherwise.

Finally, setting the Y upper limit to 20 produces conflicting linear model results:

Where "NONPOINT" had been modeled as a negative slope before, it now appears as a positive slope. Clearly the resizing of the boxplots is influencing the models. Are outlier.shape=NA and scale_y_continuous() actively removing data?

Is my approach horribly flawed? I haven't seen a better method on stack or elsewhere for removing outliers from boxplots.

good spot, sorry for typo. updating plots with lm accordingly... — AI52487963
Without seeing your data I can't say with any certainty, but when you set your limit to y = 60, you are cutting off parts of your data, and thus changing the distribution. It would make sense that if you remove values above 60 you would shift the median (and IQR) of your data downward and the max whisker would also shift downward. — scribbles
scribbles - that makes sense and would explain the change in beahvior of the linear model when cutting the Y limit down that low. Would a safe approach be, after removing outliers, to just resize to the highest whiskers across all facets (plot 3)? — AI52487963
that would be a safe approach that would likely give you results more inline with what you are expecting. It's worth noting that simply deleting outliers is rarely the best approach. Instead, you may want to explore a more robust regression method that is less likely to be heavily influenced by outliers. If this answers your question please accept the answer below so that others know your problem has been solved. — scribbles
I'm not sure your linear model is appropriate here, your data looks awfully skewed. If you are just exploring data then don't be a purist, but if you want to do some proper statistics you might want to look at other distributions, possibly poisson. Or ask on Cross Validated... — RHA

scribbles scribbles · Accepted Answer · 2015-08-21T18:47:18

This was too long for a comment so moving it to an answer. Without seeing your data I can't say with any certainty, but when you set your limit to y = 60, you are cutting off parts of your data, and thus changing the distribution. It would make sense that if you remove values above 60 you would shift the median (and IQR) of your data downward and the max whisker would also shift downward. When you limit y = 20, you are cutting off so much data that you are getting drastically different results because the data from each time point is not being affected the same way due to differences in their respective distributions.

So to answer your question, your lm (which is actually part of ggplot2) is being influenced by ggplot2 due to the artificial limits you are introducing to the data.

There are numerous ways to deal with outliers, but you may wish to deal with them outside of ggplot2. One option is removing values > +/- 3 SD's at each time point. Another is transforming your data. You may want to explore square root transformation, the log transformation, and the inverse/reciprocal transformation (in order of increasing severity). Thus, if the log transformation is not sufficient, you can use the next level of transformation. Box-Cox runs all transformations automatically so you can choose the best one.

Outliers and scales with boxplots: is my linear model being influenced in ggplot2?

1 Answers