1
votes

I am quite new to R and currently trying to create a percent stacked bar plot that I have previously always created using prism. In prism my graphs would always look like this:

Graph created with Prism

enter image description here

I have tried several approaches but I am not sure if I understand the geom_bar() function correctly. It seems like the long data format works best for the plot:

structure(list(run = c("particle_count_run1", "particle_count_run1", 
"particle_count_run1", "particle_count_run1", "particle_count_run1", 
"particle_count_run2", "particle_count_run2", "particle_count_run2", 
"particle_count_run2", "particle_count_run2", "particle_count_run3", 
"particle_count_run3", "particle_count_run3", "particle_count_run3", 
"particle_count_run3", "particle_count_run1", "particle_count_run1", 
"particle_count_run1", "particle_count_run1", "particle_count_run1", 
"particle_count_run2", "particle_count_run2", "particle_count_run2", 
"particle_count_run2", "particle_count_run2", "particle_count_run3", 
"particle_count_run3", "particle_count_run3", "particle_count_run3", 
"particle_count_run3", "particle_count_run1", "particle_count_run1", 
"particle_count_run1", "particle_count_run1", "particle_count_run1", 
"particle_count_run2", "particle_count_run2", "particle_count_run2", 
"particle_count_run2", "particle_count_run2", "particle_count_run3", 
"particle_count_run3", "particle_count_run3", "particle_count_run3", 
"particle_count_run3"), sample = c("2K", "2K", "2K", "2K", "2K", 
"2K", "2K", "2K", "2K", "2K", "2K", "2K", "2K", "2K", "2K", "10K", 
"10K", "10K", "10K", "10K", "10K", "10K", "10K", "10K", "10K", 
"10K", "10K", "10K", "10K", "10K", "SEC", "SEC", "SEC", "SEC", 
"SEC", "SEC", "SEC", "SEC", "SEC", "SEC", "SEC", "SEC", "SEC", 
"SEC", "SEC"), size_range = structure(c(5L, 4L, 3L, 2L, 1L, 5L, 
4L, 3L, 2L, 1L, 5L, 4L, 3L, 2L, 1L, 5L, 4L, 3L, 2L, 1L, 5L, 4L, 
3L, 2L, 1L, 5L, 4L, 3L, 2L, 1L, 5L, 4L, 3L, 2L, 1L, 5L, 4L, 3L, 
2L, 1L, 5L, 4L, 3L, 2L, 1L), .Label = c("5_401:1999", "4_201:399", 
"3_151:199", "2_51:149", "1_1:49"), class = "factor"), value = c(0, 
0, 4462683, 296014836, 358497149, 0, 376611, 119940, 282521877, 
318477067, 0, 0, 799317, 242354584, 385487693, 0, 3353818, 176929269, 
964906541, 220288073, 0, 7054403, 124768386, 857429863, 207014319, 
0, 14605, 117673122, 790104146, 236717487, 7772, 894924035, 62830819, 
47826581, 3787399, 247825, 776011544, 56048930, 66062865, 3264425, 
3487, 437890092, 30162534, 33433418, 0)), row.names = c(NA, -45L
), class = c("tbl_df", "tbl", "data.frame"))

Using the data I first tried to create a percent stacked bar plot:

  tmp %>%  ggplot(aes(sample, value, fill = size_range)) +
  geom_bar(position = "fill", stat = "identity")

That actually led to a plot that looks pretty similar to the one I want to achieve:

R plot percent stacked

enter image description here

I am not sure though, if the geom_bar() function actually applies my data correctly: I have

  • 3 different samples: 2K, 10K, SEC
  • For each sample I took 3 measurements: particle_count_run1, particle_count_run2, particle_count_run3
  • For each of these runs I have the amount of particles that were measured in a certain size range: the value

Since I did not know how to use all four variables with the geom_bar function I used sample on the x-axis, value on the y-axis and size_range as fill.

However, I am now not sure whether the geom_bar function now automatically takes into consideration the run variable and calculates the mean. If not I am not sure which value it takes.

Another problem I am having is that I am unable to compute the error bars while the bars are stacked. I have only been able to show them with position = "dodge" :

tmp %>%  ggplot(aes(sample, value, 
                    group = size_range, 
                    colour = size_range, 
                    fill = size_range)) +
  stat_summary(fun = mean,
               geom = "bar",
               position = "dodge") +
  stat_summary(fun.data = mean_cl_normal,
               geom = "errorbar",
               position = "dodge")

Graph

enter image description here

Whenever I am trying to change the position it no longer works.

Does anyone have an idea what I am doing wrong? I am really struggling to find a solution for the plot and would really appreciate any help possible :)

1
Welcome to SO! To help us to help you could you please make your issue reproducible by sharing a sample of your data instead of posting an image? See how to make a minimal reproducible example Simply type dput(tmp) into the console and copy & paste the output starting with structure(.... into your post.stefan
Of course! I have changed the question and copied in the output. Thanks for the dput() tip!Lena
I would check if there is any ggplot extension that does it exts.ggplot2.tidyverse.org/galleryTung
Thanks for the link! I went through them but the only one I could find that was kind of close is ggmosaic. But I also could not get it to work with the error barsLena

1 Answers

0
votes

This is not a complete answer, but I worked on it a bit and wanted to share what I got in case it helps someone else get you all the way to a full solution and it's too long for a comment so here goes:

  1. First of all, it seems like stacking error bars in ggplot is not well supported (https://stackoverflow.com/a/30873811/13210554) but can be forced manually.
  2. I believe you were trying to average replicate measurements (particle_count_run1, particle_count_run2, particle_count_run3) and you wish to represent the mean and variance of those. I think this is a sensible thing to do, that you have the data arranged in a suitable way to accomplish this and that your graph represents what you want it to.
  3. The hard part is getting ggplot2 to give you those stacked error bars. Here's where I can only get you part of the way...

The first point I'd make here is that for the example plot you showed, the error bars are only going up which keeps the plot visually clean. I agree with this approach for the stacked plot, but it means that you'd have to modify the default calculation of the bottom error bar calculation.

This brings me to the core issue which is appropriately calculating the values for the stacked error bars. One option is to do this outside the ggplot call and just pull in a separate data.frame into stat_summary. The option I was trying to make work but have so far failed was to make a generalizable solution inside of stat_summary with a custom function. In case it's helpful, I wanted to share what I did so far.

I took the guts of Hmisc::smean.cl.normal as my starting point:

## function (x, mult = qt((1 + conf.int)/2, n - 1), conf.int = 0.95, 
##     na.rm = TRUE) 
## {
##     if (na.rm) 
##         x <- x[!is.na(x)]
##     n <- length(x)
##     if (n < 2) 
##         return(c(Mean = mean(x), Lower = NA, Upper = NA))
##     xbar <- sum(x)/n
##     se <- sqrt(sum((x - xbar)^2)/n/(n - 1))
##     c(Mean = xbar, Lower = xbar - mult * se, Upper = xbar + mult * 
##         se)
## }

You can put this into the stat_summary call and produce the same plot by simply renaming the names of the returned variables (i.e.upper becomes ymax):

df1 %>%  ggplot(aes(sample, value, fill = size_range)) +
  geom_col(position = "stack") +
  stat_summary(fun.data =
                 function (x,
                           mult = qt((1 + conf.int) / 2, n - 1),
                           conf.int = 0.95,
                           na.rm = TRUE)
                 {
                   if (na.rm)
                     x <- x[!is.na(x)]
                   n <- length(x)
                   xbar <- sum(x) / n
                   se <- sqrt(sum((x - xbar) ^ 2) / n / (n - 1))
                   c(ymin = xbar,
                     ymax = xbar + mult * se)
                 },
               geom = "errorbar",
               width = 0.5,
               color = "black"
  )

plot with calculated error bars

Note that now the bottom of the error bar is the mean so it ends up as a one sided error bar. It will probably look best if you add a black outline to your bars at the end to cover the bottom tail (unless you find a way to remove it).

So now the trouble is that the y position is where it should be for each bar if it started at the x axis. So you need to somehow modify it to cumulatively add each subgroup to the value. Then to get it as fill rather than stack, you'd need to divide by the sum of each group to get it to total 1. It might not be possible to get it inside the stat_summary call, but maybe you can use that code to perform the calculation outside.