If you were writing a bioinformatics algorithm in Haskell, you'd probably use an algebraic data type to represent the nucleotides:
data Nucleotide = A | T | C | G
You'd do similarly in Standard ML or OCaml, I assume (I've never really used either).
A value of type Nucleotide
can clearly be contained in two bits. However, doing so would cause access times to be slower than if you used one byte per Nucleotide
value, as you'd need to select out the two bits of interest using binary operators.
There is therefore an inherent tradeoff that the compiler must make between memory efficiency and computational efficiency when deciding how to represent algebraic data types. Furthermore, the representation of algebraic data types in memory is made more complicated by the fact that the value can be of variable size:
data Maybe a = Just a | Nothing
Clearly, a Maybe a
value of the form Just a
is logically larger than a value of the form Nothing
. In an extreme example like this:
data Hulk a b c d e = Big a b c d e | Little
you definitely wouldn't want to have to store in a Little
value null pointers or zero values for the five values contained in Big
values. I'm assuming that you'd just use heap-allocated memory of variable size, with a constructor ID at the beginning (for example, 0
for Big
and 1
for Little
). However, if you wanted to store Hulk
values on the stack (a faster representation), you'd have to store the blank memory along with Little
values so that all values of the type Hulk
are the same size. Another tradeoff.
Simon Marlow answered my general question in regard to GHC in a previous StackOverflow question. However, I have three related questions that remain unanswered:
- Do Standard ML (SML/NJ and MLton) and OCaml use the same technique?
- If so, do any less common compilers of these languages (or their siblings) experiment with other techniques?
- Is there a reasonably easy way (ideally a pragma or option flag) in these languages to use a more memory efficient representation, like the two-bit representation of
Nucleotide
? Such memory efficiency is necessary for many bioinformatics applications; if eachNucleotide
had to be one byte, high-performance bioinformatics algorithms would have to resort to manual bit-fiddling.
-ddump-asm
or-ddump-simpl
to view how it's stored at a lower level. Basically, for your simple example, each tag appears to be represented as along
, but there's some meta data in there that I'm not really sure what it's doing either. The basic gist is that each constructor gets turned into a closure, then those get combined to form the data type's closure. – bheklilr