1
votes

Biologist and ggplot2 beginner here. I have a relatively large dataset of DNA sequence data (millions of short DNA fragments) which I first need to filter for quality for each sequence. I would like to illustrate how many of my reads are getting filtered out with a stacked bar plot using ggplot2.

I have figured out that ggplot likes the data in long format and have succesfully reformatted it with the melt function from reshape2

This is what a subset of the data looks like at the moment:

library sample  filter  value
LIB0    0011a   F1  1272707
LIB0    0018a   F1  1505554
LIB0    0048a   F1  1394718
LIB0    0095a   F1  2239035
LIB0    0011a   F2  250000
LIB0    0018a   F2  10000
LIB0    0048a   F2  10000
LIB0    0095a   F2  10000
LIB0    0011a   P   2118559
LIB0    0018a   P   2490068
LIB0    0048a   P   2371131
LIB0    0095a   P   3446715
LIB1    0007b   F1  19377
LIB1    0010b   F1  79115
LIB1    0011b   F1  2680
LIB1    0007b   F2  10000
LIB1    0010b   F2  10000
LIB1    0011b   F2  10000
LIB1    0007b   P   290891
LIB1    0010b   P   1255638
LIB1    0011b   P   4538

library and sample are my ID variables (the same sample can be in multiple libraries). 'F1' and 'F2' mean that this many reads were filtered out during this step, 'P' means the remaining number of sequence reads after filtering.

I have figured out how to make a basic stacked barplot but now I am running into trouble because I cannot figure out how to properly reorder the factors on the x-axis so the bars are sorted in descending order in the plot based on the sum of F1, F2 and P. The way it is now I think they are sorted alphabetically within library based on sample name

testdata <- read.csv('testdata.csv', header = T, sep = '\t')

ggplot(testdata, aes(x=sample, y=value, fill=filter)) + 
  geom_bar(stat='identity') +
  facet_wrap(~library, scales = 'free')

enter image description here

After some googling I found out about the aggregate function that gives me the total for each sample per library:

aggregate(value ~ library+sample, testdata, sum)

  library sample   value
1    LIB1  0007b  320268
2    LIB1  0010b 1344753
3    LIB0  0011a 3641266
4    LIB1  0011b   17218
5    LIB0  0018a 4005622
6    LIB0  0048a 3775849
7    LIB0  0095a 5695750

While this does give me the totals, I now have no idea how I can use this to reorder the factors, especially since there are two I need to consider (library and sample).

So I guess my question boils down to: How can I order my samples in my graph based on the total of F1, F2 and P for each library?

Thank you very much for any pointers you can give me!

1
Could this from SO be of help to you?KoenV

1 Answers

1
votes

You are almost there. You need to change factor levels of testdata$sample based on the aggregated data (I suppose no sample name appeared in both lib1 and lib0):

df <- aggregate(value ~ library+sample, testdata, sum)

testdata$sample <- factor(testdata$sample, levels = df$sample[order(-df$value)])

ggplot(testdata, aes(x=sample, y=value, fill=filter)) + 
    geom_bar(stat='identity') +
    facet_wrap(~library, scales = 'free')

enter image description here