Get R ggplot boxplot for counted data

Question

I am trying to generate a grouped boxplot with ggplot. It worked well,but then we had a change in data structure.

Before, we had this structure:

sample  length
sample1 50
sample1 50
sample1 50
sample1 51
sample1 51
sample2 50
sample2 50
sample2 51
sample2 51
sample2 51

& for every sample, a boxplot was generated.

We used following code to generate the boxplot:

    for_sequence_length_distribution_plot <- ggplot(for_sequence_length_distribution, aes(fill = read,
                                            group = interaction(sample, state, read),
                                            x = sample,
                                            y = sequenceLength)) +
  geom_boxplot(inherit.aes = T,
                 size = 0.25,
                 #outlier.size = 0.1,
                 #outlier.alpha = 0.25,
                 outlier.shape = NA,
                 #varwidth = T,
                 alpha = .9,
                 trim = F,
                 width = 5
                 ) +
     coord_flip() +
     xlab('sample id') +
     ylab('sequence length [bp]') +
     facet_grid_paginate(facets = ~sub_id~state, ncol = 2, nrow = row_count, scales = 'free', shrink = F, byrow = T) +
     theme( plot.title = element_text(hjust = 0, lineheight = 0.9),
         plot.subtitle = element_text(hjust = 0, lineheight = 0.9),
         axis.title.x = element_text(margin = margin(t = 1, r = 1, b = 1, l = 1, unit = 'line')),
         #strip.text.x = element_blank(),
         strip.text.y = element_blank(),
         plot.tag.position = "bottomleft",
         plot.tag = element_text(hjust = 0, vjust = 0, lineheight = 0.9),
         plot.caption = element_text(hjust = 0, lineheight = 0, face = "italic"),
         plot.caption.position = "panel",
         legend.position = "bottom",
         legend.box = 'horizontal')

Now, we have this structure to save disk space:

sample  length  count
sample1 50  3
sample1 51  2
sample2 50  2
sample2 51  3

And now I wonder how to change the upper code to get to the same result. The change should be as minimal as possible.

I already tried to change the first part to for_sequence_length_distribution_plot <- ggplot(for_sequence_length_distribution, aes(fill = read, group = interaction(sample, state, read), colour = count, x = sample, y = sequenceLength, weight = count))

but this resulted in an empty plot. (For simplicity in showing the datastructure I just mentioned the "sample", but there are more indicators, like read (can have values R1 and R2) and state (raw or trimmed). So for every sample there are four boxplots -> sample-R1-raw; sample-R1-trimmed; sample-R2-raw; sample-R2-trimmed. But I dont think this is relevant here.)

library(tidyr); ggplot(for_sequence_length_distribution %>% uncount(count), .... uncount will copy a row as many times as specified, here by count, effectively reproducing your original data format. — Jon Spring

Jon Spring Jon Spring · Accepted Answer · 2021-10-12T17:31:41

Equivalent of original plot:

mtcars <- mtcars # saving to environment to see it's 32 rows 
ggplot(mtcars, aes(cyl, mpg, group = gear)) +
  geom_boxplot()

Now we have new format of data where the unique combinations are combined and counted:

mtcars_summary <- dplyr::count(mtcars, cyl, gear, mpg) # grouped, now 28 rows

To use this with original code, we can tidyr::uncount to make as many copies of each row as we specify:

ggplot(tidyr::uncount(mtcars_summary, n), aes(cyl, mpg, group = gear)) +
  geom_boxplot()

Get R ggplot boxplot for counted data

1 Answers