bioinformatics compressing nucleotide sequences

Question

What would be the recommended compression algorithm (.xz, tar.gz, tar.bz2 and so on) for compressing a dataset consisting of fasta nucleotide sequences?

What would be the recommended compression mechanisms for such data?

Dictionary based compression
Adaptive dictionary based compression
LZW algorithm based compression

Use gzip because everyone uses gzip. Even if you can squeeze a bit more compression out of another method, more bioinformatics tools will read gzipped files. — CJR
Certainly not LZW. That's obsolete technology. A great deal of attention has been paid to the compression of sequencing data. For fasta, see ncbi.nlm.nih.gov/pmc/articles/PMC3866555 — Mark Adler

Timur Shtatland Timur Shtatland · Accepted Answer · 2021-11-01T16:57:51

I have seen gzip used most often, so I recommend gzip, as CJR mentioned in the comment. This is the option most compatible with the collaborators, even though not the most efficient (depending on your definition of efficiency).

Under some conditions, where the collaborators and you can install specialized compressing tools, it might be worth looking into more efficient tools, for example see this paper, which compares many of them using several different metrics (especially Figure 1):

Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, Tadashi Imanishi, Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences, GigaScience, Volume 9, Issue 7, July 2020, giaa072, https://doi.org/10.1093/gigascience/giaa072 : https://academic.oup.com/gigascience/article/9/7/giaa072/5867695

bioinformatics compressing nucleotide sequences

1 Answers