12
votes

So I have a three column data frame that has Trials, Ind. Variable, Observation. Something like:

df1<- data.frame(Trial=rep(1:10,5), Variable=rep(1:5, each=10), Observation=rnorm(1:50))

I am trying to plot a 95% conf. Interval around the mean for each trial using a rather inefficient method as follows:

    b<-NULL
    b$mean<- aggregate(Observation~Variable, data=df1,mean)[,2]
    b$sd  <- aggregate(Observation~Variable, data=df1,sd)[,2]
    b$Variable<- df1$Variable
    b$Observation <- df1$Observation 
    b$ucl <- rep(qnorm(.975, mean=b$mean, sd=b$sd), each=10)
    b$lcl <- rep(qnorm(.025, mean=b$mean, sd=b$sd), each=10)
    b<- as.data.frame(b)
    c <- ggplot(b, aes(Variable, Observation))  
    c + geom_point(color="red") + 
    geom_smooth(aes(ymin = lcl, ymax = ucl), data=b, stat="summary", fun.y="mean")

This is inefficient since it duplicates values for ymin, ymax. I've seen the geom_ribbon methods but I would still need to duplicate. However, if I was using any kind of smoothing like glm, the code is much simpler with no duplication. Is there a better way of doing this?

References: 1. R Plotting confidence bands with ggplot 2. Shading confidence intervals manually with ggplot2 3. http://docs.ggplot2.org/current/geom_smooth.html

2
It looks like your variable is discrete. How about using boxplots? Not exactly the same, but maybe good enough for your purposes? Or is your real variable continuous?BrodieG
@BrodieG: Variable unfortunately is continuous(100+ data-points), otherwise boxplot would have been perfect.user2217564
So do you wan the CI for a rolling bucket? Or do you want the CI for each of the X values (assuming multiple instances of each X value)?BrodieG
yes.. but NOT individual instances rather aggregate ones. I'm looking for CLs for each value of X regardless of the number of instances. If you try the code in the question it does that. I just thought there must be a stat or something more efficient.user2217564
I am even considering writing a small CI function and passing it to summary but haven't found a way that would make it work.user2217564

2 Answers

10
votes

With this method, I get the same output as with your method. This was inspired by the docs for ggplot. Again, this will be meaningful so long as each x value has multiple points.

set.seed(1)
df1 <- data.frame(Trial=rep(1:10,5), Variable=rep(1:5, each=10), Observation=rnorm(1:50))    my_ci <- function(x) data.frame(y=mean(x), ymin=mean(x)-2*sd(x), ymax=mean(x)+2*sd(x))

my_ci <- function(x) data.frame(
  y=mean(x), 
  ymin=mean(x) - 2 * sd(x), 
  ymax=mean(x) + 2 * sd(x)
)
ggplot(df1, aes(Variable, Observation)) + geom_point(color="red") +
  stat_summary(fun.data="my_ci", geom="smooth")

enter image description here

7
votes

The ggplot package comes with wrappers for a number of summarizing functions in the Hmisc package, including

  1. mean_cl_normal which calculates the confidence limits based on the t-distribution,
  2. mean_cl_boot which uses a bootstrap method that does not assume a distribution of the mean,
  3. mean_sdl which uses a multiple of the standard deviation (default=2).

This latter method is the same as in the answer above, but is not the 95% CL. Confidence limits based on the t-distribution are given by:

CL = t × s / √n

Where t is the appropriate quantile of the t-distribution and s is the sample standard deviation. Compare the confidence bands:

ggplot(df1, aes(x=Variable, y=Observation)) + 
  stat_summary(fun.data="mean_sdl", geom="line", colour="blue")+
  stat_summary(fun.data="mean_sdl", mult=2, geom="errorbar", 
               width=0.1, linetype=2, colour="blue")+
  geom_point(color="red") +
  labs(title=expression(paste(bar(x)," \u00B1 ","2 * sd")))

ggplot(df1, aes(x=Variable, y=Observation)) + 
  geom_point(color="red") +
  stat_summary(fun.data="mean_cl_normal", geom="line", colour="blue")+
  stat_summary(fun.data="mean_cl_normal", conf.int=0.95, geom="errorbar", 
               width=0.1, linetype=2, colour="blue")+
  stat_summary(fun.data="mean_cl_normal", geom="point", size=3, 
               shape=1, colour="blue")+
  labs(title=expression(paste(bar(x)," \u00B1 ","t * sd / sqrt(n)")))

Finally, rotating this last plot using coord_flip() generates something very close to a Forest Plot, which is a standard method for summarizing data like yours.

ggplot(df1, aes(x=Variable, y=Observation)) + 
  geom_point(color="red") +
  stat_summary(fun.data="mean_cl_normal", conf.int=0.95, geom="errorbar", 
               width=0.2, colour="blue")+
  stat_summary(fun.data="mean_cl_normal", geom="point", size=3, 
               shape=1, colour="blue")+
  geom_hline(aes(yintercept=mean(Observation)), linetype=2)+
  labs(title="Forest Plot")+
  coord_flip()