I'm trying to determine if there's a general trend for some data. I'm plotting emission data (in tons) against year across different emission types. I feel like I'm approaching the problem right, but maybe I don't fully understand how removing (or hiding?) outliers influences a linear model fit on boxplots. My approach is to plot some boxplots for the data itself, and overlay a linear model for each facet to understand the general trend for each emission type.
With outliers:
q <- ggplot(balt, aes(year, Emissions))
q +
geom_boxplot(aes(color=factor(year))) +
facet_grid(.~type)
geom_smooth(method='lm')
produces:
Without outliers:
q +
geom_boxplot(aes(color=factor(year)), outlier.shape=NA) +
facet_grid(.~type)
geom_smooth(method='lm')
produces:
Clearly I want to resize the Y axis, so I set the upper Y limit to what looks like 80, since the top tail of the 1999 boxplot for "NONPOINT" looks like it comes pretty close to there from the previous two plots:
q +
geom_boxplot(aes(color=factor(year)), outlier.shape=NA) +
scale_y_continuous(limits=c(0,80)) +
facet_grid(.~type)
geom_smooth(method='lm')
produces:
To demonstrate that it's resizing, I reset the upper Y limit to 60, in which the "NONPOINT" 1999 boxplot is clearly crossing:
The warnings I'm getting for the final plot read as:
Warning messages:
1: Removed 4 rows containing non-finite values (stat_boxplot).
2: Removed 24 rows containing non-finite values (stat_boxplot).
3: Removed 9 rows containing non-finite values (stat_boxplot).
4: Removed 4 rows containing missing values (stat_smooth).
5: Removed 24 rows containing missing values (stat_smooth).
6: Removed 9 rows containing missing values (stat_smooth).
7: Removed 9 rows containing missing values (geom_point).
8: Removed 17 rows containing missing values (geom_point).
9: Removed 17 rows containing missing values (geom_point).
10: Removed 17 rows containing missing values (geom_point).
11: Removed 1 rows containing missing values (geom_point).
12: Removed 1 rows containing missing values (geom_point).
13: Removed 1 rows containing missing values (geom_point).
14: Removed 2 rows containing missing values (geom_point).
15: Removed 20 rows containing missing values (geom_point).
16: Removed 52 rows containing missing values (geom_point).
17: Removed 59 rows containing missing values (geom_point).
18: Removed 41 rows containing missing values (geom_point).
19: Removed 1 rows containing missing values (geom_point).
20: Removed 7 rows containing missing values (geom_point).
21: Removed 10 rows containing missing values (geom_point).
22: Removed 43 rows containing missing values (geom_point).
18: Removed 10 rows containing missing values (geom_point).
19: Removed 43 rows containing missing values (geom_point).
I'm not quite what to make of the of the non-finite values, but the rest of the warnings look like they're just removing outliers? I could be wrong here, but I wouldn't know how to guess otherwise.
Finally, setting the Y upper limit to 20 produces conflicting linear model results:
Where "NONPOINT" had been modeled as a negative slope before, it now appears as a positive slope. Clearly the resizing of the boxplots is influencing the models. Are outlier.shape=NA
and scale_y_continuous()
actively removing data?
Is my approach horribly flawed? I haven't seen a better method on stack or elsewhere for removing outliers from boxplots.