1
votes

I'm currently plotting some data (response times in ms) in geom_boxplot.

I have a question:

When you adjust the limits on the y-axis does it disregard any values above that in the plotting & error bar calculations?

The data itself comprises of over 20k entries and I'm not sure providing a sample will be of much use as this is a more functionality based question.

Here is the code I use:

f <- function(x) {ans <- boxplot.stats(x)
data.frame(ymin = ans$conf[1], ymax = ans$conf[2], y = ans$stats[3])}





RTs.box = ggplot(mean.vis.aud.long, aes(x = Report, y = RTs, fill =Report)) + theme_bw() + facet_grid(Audio~Visual) 
RTs.box + 
geom_boxplot(alpha = .8) + geom_hline(yintercept = .333, linetype = 3, alpha = .8) + theme(legend.position = "none") + ylab("Reposponse Times ms") + scale_fill_grey(start=.4) +
 labs(title = expression("Visual Condition")) + theme(plot.title = element_text(size = rel(1)))+
 theme(panel.background = element_rect())+

 #line below for shaded confidence intervals    
 stat_summary(fun.data = f, geom = "crossbar", 
            colour = NA, fill = "skyblue", width = 0.75, alpha = .9)+
 ylim(0,1000)#this is the value that I change that results in different plots and shaded confidence intervals

Here is the plot with

ylim(0,1000)

enter image description here

And using the same data but changing the limit to

ylim(0,3000)

results in this plot:

enter image description here

As you can see the values in the boxplots appear to be adjusted according to the limit used. Instead of plotting to the edge of the limit the percentiles are reduced. This is apparent when you compare the middle boxplot in the top-left panel of both grids.

There are differences in the confidence intervals also as can be seen.

Does this mean geom_boxplot is discarding the data above the limit or is there something I'm missing?

I want to include all the data when plotting the boxplot & confidence intervals but limit the scale so it can be seen clearly. It means not seeing some major outliers in the data but for my purposes that is fine.

Has anyone got any suggestions as to what is going on here & how to get around it without potentially dropping the values from the data outside the visual range chosen for my calculation?

Thanks as always.

2
you can use ylim(0,max(mean.vis.aud.long$RTs))Andrelrms

2 Answers

3
votes

From ?ylim "Observations not in this range will be dropped completely and not passed to any other layers. If a NA value is substituted for one of the limits that limit is automatically calculated."

If you want to adjust the limits without affecting the data, use coord_cartesian instead.

2
votes

The function ylim clearly influences which data points are used for plotting. T avoid this, you may want to use coord_cartesian, which will not change the underlying data.

Try to replace ylim(0,1000) with:

coord_cartesian(ylim = c(0,1000))