2
votes

The following command generates a simple histogram:

g<- ggplot(data = mtcars, aes(x = factor(carb) )) + geom_histogram()

Usually I add errorbars to my plots like this:

g+stat_summary(fun.data="mean_cl_boot",geom="errorbar",conf.int=.95)

But that doesn't work with a histogram ("Error: geom_errorbar requires the following missing aesthetics: ymin, ymax "), I think because the y variable is not explicit- counts are automatically calculated by geom_histogram, so one doesn't declare the y variable.

Are we unable to use geom_histogram and instead must first calculate the y quantity (counts) ourselves, and then specify it as the y variable with a call to geom_bar?

2
What kind of error bar do you want to add to a histogram?Sven Hohenstein
The 95% confidence interval on the number of observations in the bin would be nice. The best formula for such a confidence interval would be more of an issue for stats.stackexchange.com, but here's a start: suchideas.com/articles/maths/applied/histogram-errors . Here I'm asking for the code to add an error bar of any kind.Alex Holcombe

2 Answers

2
votes

It seems that indeed one cannot use geom_histogram and instead we must calculate the counts (bar heights) and confidence interval limits manually. First, to calculate the counts:

library(plyr)
mtcars_counts <- ddply(mtcars, .(carb), function(x) data.frame(count=nrow(x)))

The remaining problem is calculating the confidence interval for a binomial proportion, here the count divided by the total number of cases in the data set. A variety of formulae have been proposed in the literature. Here we will use the Agresti & Coull (1998) method as implemented in the PropCIs library.

library(PropCIs)
numTotTrials <- sum(mtcars_counts$count)

# Create a CI function for use with ddply and based on our total number of cases.
makeAdd4CIforThisHist <- function(totNumCases,conf.int) {
  add4CIforThisHist <- function(df) {
     CIstuff<- add4ci(df$count,totNumCases,conf.int)
     data.frame( ymin= totNumCases*CIstuff$conf.int[1], ymax = totNumCases*CIstuff$conf.int[2] ) 
  }
  return (add4CIforThisHist)
}

calcCI <- makeAdd4CIforThisHist(numTotTrials,.95)

limits<- ddply(mtcars_counts,.(carb),calcCI) #calculate the CI min,max for each bar

mtcars_counts <- merge(mtcars_counts,limits) #combine the counts dataframe with the CIs

g<-ggplot(data =mtcars_counts, aes(x=carb,y=count,ymin=ymin,ymax=ymax)) + geom_bar(stat="identity",fill="grey")
g+geom_errorbar()

resulting graph

1
votes

I am not sure that what you want to do is statistically valid.

For example, If we perform the summary (bin/compute) manually for example, we get NA for upper and Lower:

mtcars$carb_bin <- factor(cut(mtcars$cyl,8,labels=FALSE))
library(plyr)
mtcars_sum <- ddply(mtcars, "carb_bin", 
                 function(x)smean.cl.boot(length(x$carb)))
mtcars_sum
  carb_bin Mean Lower Upper
1        1   11    NA    NA
2        4    7    NA    NA
3        8   14    NA    NA

And even if you compute just the y and give this to ggplot2 to plot geom_bar and error_bar, you will not get error_bar since upper and lower are not well defined.

mtcars_sum <- ddply(mtcars, "carb_bin", summarise,
                    y = length(carb))

ggplot(data = mtcars_sum, aes(x=carb_bin,y=y)) + 
  geom_bar(stat='identity',alpha=0.2)+
  stat_summary(fun.data="mean_cl_normal",col='red',
               conf.int=.95,geom='pointrange')

enter image description here