Biologist and ggplot2 beginner here. I have a relatively large dataset of DNA sequence data (millions of short DNA fragments) which I first need to filter for quality for each sequence. I would like to illustrate how many of my reads are getting filtered out with a stacked bar plot using ggplot2.
I have figured out that ggplot likes the data in long format and have succesfully reformatted it with the melt function from reshape2
This is what a subset of the data looks like at the moment:
library sample filter value
LIB0 0011a F1 1272707
LIB0 0018a F1 1505554
LIB0 0048a F1 1394718
LIB0 0095a F1 2239035
LIB0 0011a F2 250000
LIB0 0018a F2 10000
LIB0 0048a F2 10000
LIB0 0095a F2 10000
LIB0 0011a P 2118559
LIB0 0018a P 2490068
LIB0 0048a P 2371131
LIB0 0095a P 3446715
LIB1 0007b F1 19377
LIB1 0010b F1 79115
LIB1 0011b F1 2680
LIB1 0007b F2 10000
LIB1 0010b F2 10000
LIB1 0011b F2 10000
LIB1 0007b P 290891
LIB1 0010b P 1255638
LIB1 0011b P 4538
library and sample are my ID variables (the same sample can be in multiple libraries). 'F1' and 'F2' mean that this many reads were filtered out during this step, 'P' means the remaining number of sequence reads after filtering.
I have figured out how to make a basic stacked barplot but now I am running into trouble because I cannot figure out how to properly reorder the factors on the x-axis so the bars are sorted in descending order in the plot based on the sum of F1, F2 and P. The way it is now I think they are sorted alphabetically within library based on sample name
testdata <- read.csv('testdata.csv', header = T, sep = '\t')
ggplot(testdata, aes(x=sample, y=value, fill=filter)) +
geom_bar(stat='identity') +
facet_wrap(~library, scales = 'free')
After some googling I found out about the aggregate function that gives me the total for each sample per library:
aggregate(value ~ library+sample, testdata, sum)
library sample value
1 LIB1 0007b 320268
2 LIB1 0010b 1344753
3 LIB0 0011a 3641266
4 LIB1 0011b 17218
5 LIB0 0018a 4005622
6 LIB0 0048a 3775849
7 LIB0 0095a 5695750
While this does give me the totals, I now have no idea how I can use this to reorder the factors, especially since there are two I need to consider (library and sample).
So I guess my question boils down to: How can I order my samples in my graph based on the total of F1, F2 and P for each library?
Thank you very much for any pointers you can give me!