I am trying to generate a grouped boxplot with ggplot. It worked well,but then we had a change in data structure.
Before, we had this structure:
sample length
sample1 50
sample1 50
sample1 50
sample1 51
sample1 51
sample2 50
sample2 50
sample2 51
sample2 51
sample2 51
& for every sample, a boxplot was generated.
We used following code to generate the boxplot:
for_sequence_length_distribution_plot <- ggplot(for_sequence_length_distribution, aes(fill = read,
group = interaction(sample, state, read),
x = sample,
y = sequenceLength)) +
geom_boxplot(inherit.aes = T,
size = 0.25,
#outlier.size = 0.1,
#outlier.alpha = 0.25,
outlier.shape = NA,
#varwidth = T,
alpha = .9,
trim = F,
width = 5
) +
coord_flip() +
xlab('sample id') +
ylab('sequence length [bp]') +
facet_grid_paginate(facets = ~sub_id~state, ncol = 2, nrow = row_count, scales = 'free', shrink = F, byrow = T) +
theme( plot.title = element_text(hjust = 0, lineheight = 0.9),
plot.subtitle = element_text(hjust = 0, lineheight = 0.9),
axis.title.x = element_text(margin = margin(t = 1, r = 1, b = 1, l = 1, unit = 'line')),
#strip.text.x = element_blank(),
strip.text.y = element_blank(),
plot.tag.position = "bottomleft",
plot.tag = element_text(hjust = 0, vjust = 0, lineheight = 0.9),
plot.caption = element_text(hjust = 0, lineheight = 0, face = "italic"),
plot.caption.position = "panel",
legend.position = "bottom",
legend.box = 'horizontal')
Now, we have this structure to save disk space:
sample length count
sample1 50 3
sample1 51 2
sample2 50 2
sample2 51 3
And now I wonder how to change the upper code to get to the same result. The change should be as minimal as possible.
I already tried to change the first part to for_sequence_length_distribution_plot <- ggplot(for_sequence_length_distribution, aes(fill = read, group = interaction(sample, state, read), colour = count, x = sample, y = sequenceLength, weight = count))
but this resulted in an empty plot. (For simplicity in showing the datastructure I just mentioned the "sample", but there are more indicators, like read (can have values R1 and R2) and state (raw or trimmed). So for every sample there are four boxplots -> sample-R1-raw; sample-R1-trimmed; sample-R2-raw; sample-R2-trimmed. But I dont think this is relevant here.)
library(tidyr); ggplot(for_sequence_length_distribution %>% uncount(count), ....
uncount will copy a row as many times as specified, here bycount
, effectively reproducing your original data format. – Jon Spring