2
votes

I want to display "n = (n)" over the whiskers of each of my boxplots. I have figured out how to put these labels over the top of each box (q75) using fivenum, but I can't get them working above the whisker. Above the whiskers is better because my plots are very cluttered.

Here I've reproduced the plots using mtcars Edit: mtcars has no significant outliers, but my dataset does. That's why the label needs to be on top of the whisker, and not just on the highest data point.

sidenote: I am working with a lot of outliers and want to take them out of the display. GGplot can do this, but it will still include outliers in the axis, which gives me a very "zoomed out" plot. My workaround for this is included. I've used the base boxplot function to calculate the highest whisker, and used coord_cartesian to set the upper limit just above that.

> data("mtcars")
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> 
> d = data.table(mtcars)
> 
> give.n <- function(x){
+   return(data.frame(y = fivenum(x)[4],
+                     label = paste("n =",length(x))))
+ }
> 
> p1 <- boxplot(mpg~cyl, data=mtcars, outline=FALSE,
+               plot=0)
> p1stats <- p1$stats[5,]
> head(p1stats)
[1] 33.9 21.4 19.2
> upperlim <- max(p1$stats, na.rm = TRUE) * 1.05
>   
> p <- ggplot(d, aes(x=factor(cyl), y=mpg)) +
+     geom_boxplot() +
+ stat_summary(fun.data = give.n, geom = "text", vjust=-.5)
> 
> p <- p + coord_cartesian(ylim = c(0, upperlim))

I tried changing this function (which works):

> give.n <- function(x){
+   return(data.frame(y = fivenum(x)[4],
+                     label = paste("n =",length(x))))
+ }

To this, using the 5th row of p1 stats (the upper whiskers):

give.n <- function(x){
  return(data.frame(y = p1stats,
                    label = paste("n =",length(x))))
}

But that returns this: bad plot

How do I get this to display the label on only the correct whisker point for each box?

PS - My apologies, I'm unfamiliar with posting here but I tried

3

3 Answers

1
votes

Here is a ggplot solution with dpylr:

ggplot(mtcars, aes(x=cyl, y=mpg, group=cyl)) + 
  geom_boxplot() + 
  geom_text(data=mtcars %>% group_by(cyl) %>% summarise(top = max(mpg), n=n()), aes(x=cyl, y=top, label= paste0("n = ", n)), nudge_y=1)

enter image description here

EDIT

There's probably a more concise way, but I think this works. I edited a data point for cyl=8 for emphasis:

 ggplot(mtcars, aes(x=cyl, y=mpg, group=cyl)) + 
  geom_boxplot() + 
  geom_text(data=mtcars %>% 
              group_by(cyl) %>% 
              summarise(q3 = quantile(mpg, 0.75),
                        q1 = quantile(mpg, 0.25),
                        iqr = q3 - q1,
                        top = min(q3 + 1.5*iqr, max(mpg)), 
                        n=n()), 
            aes(x=cyl, y=top, label= paste0("n = ", n)), nudge_y=1)

enter image description here

1
votes

Okay scratch that last attempt. I figured it out. boxplot.stats and geom_boxplot calculate quartile stats differently, and that skews everything in small sample sizes. We can call the actual stats geom_boxplot uses with ggplot_build.

This is how it's done, son. First, make your plot, like above, I called it p. Now calculate sample size for each x variable

samp <- count(mtcars, cyl)

now retrieve the data from the plot using ggplot_build

ggstat <- ggplot_build(p)$data
ggwhisk1 <- ggstat[[1]]$ymax

now combine that with the sample size, and call that data in geom_text

ggwhisk2 <- data.frame(samp, whisk = ggwhisk1)
p <- p + geom_text(data = ggwhisk2, size = 2,
aes(x = cyl, y = whisk, label = paste0("n =", n), vjust = -.5))

Voila!!

0
votes

Edit: see the comment below and my other answer!

Okay I figured it out using the format of Alan's answer. It needed boxplot.stats to get the correct whisker calculation:

geom_text(data=mtcars %>% group_by(cyl) %>%
            summarise(n = n(),
                      boxstats = boxplot.stats(mpg)[1],
                      whisker = boxstats[5]),
            aes(x=cyl, y=whisker, label=paste0("n =", n)))