In a comment on the previous question, I claimed:
I have another benchmark to indicate ghc-7.4.1+llvm does unpacking of enumerations on strict data fields.
In fact, after some experimentation, I believe that, in at least some simplistic cases, using an enumeration is at least as fast as, and actually might be more memory efficient (hence faster in a more realistic application) than using a newtype of Word8 (or Int), even when used in a strict field of a datatype. As I said in the previous question, I have experienced the similar phenomenon in a more realistic (but still small) setting.
Can anyone point me to some relevant references as to what optimizations ghc/llvm does for enumerations? In particular, does it really unpack the internal tag of the enumeration on a strict data field? The assembly output and the result of profiling seem to suggest that that is the case, but to me there is no indication on the core level. Any insights will be greatly appreciated.
One more question: Are enumerations always at least as efficient as newtypes of the corresponding Integral, where using them makes sense? (Note that enumerations can behave like Integrals,too.) If not, what is the (hopefully realistically useful) exception? Daniel Fischer suggested in his answer that putting an enumeration on a strict field of a multiple-constructor datatype might prevent certain optimizations. However I have failed to verify this in the two-constructor case. Maybe there is a difference when putting them in a large multiple-constructor datatypes?
I'm also curious about what exactly is happening in the following benchmark. In all the four cases, roughly the same amount of bytes were allocated in the heap. However, for enumerations, the GC actually did less copying and the maximum residencies were smaller compared to newtypes.
(Actually my real question was, is it worthwhile to to try and convert enumerations to newtypes when performance matters? , but I assumed that being a bit more specific might be more helpful.)
A possible implication of this question would be this: If you are using a large number of Int's in your program which actually varies on some very small subset, then changing them to an enumeration (rather than to an unboxed type!) might be a performance gain (but careful for strictness).
Here is the summary of the benchmark, followed by the benchmark code and the convenience makefile for testing on your system.
benchmarking d mean: 11.09113 ns, lb 11.06140 ns, ub 11.17545 ns, ci 0.950 std dev: 234.6722 ps, lb 72.31532 ps, ub 490.1156 ps, ci 0.950 benchmarking e mean: 11.54242 ns, lb 11.51789 ns, ub 11.59720 ns, ci 0.950 std dev: 178.8556 ps, lb 73.05290 ps, ub 309.0252 ps, ci 0.950 benchmarking s mean: 11.74964 ns, lb 11.52543 ns, ub 12.50447 ns, ci 0.950 std dev: 1.803095 ns, lb 207.2720 ps, ub 4.029809 ns, ci 0.950 benchmarking t mean: 11.89797 ns, lb 11.86266 ns, ub 11.99105 ns, ci 0.950 std dev: 269.5795 ps, lb 81.65093 ps, ub 533.8658 ps, ci 0.950 OK,so the enumeration appears at least no less efficient than the newtype Next,heap profiles of the function heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x data D = A | B | C: 10,892,604 bytes allocated in the heap 6,401,260 bytes copied during GC 1,396,092 bytes maximum residency (3 sample(s)) 55,940 bytes maximum slop 6 MB total memory in use (0 MB lost due to fragmentation) Productivity 47.8% of total user, 35.4% of total elapsed newtype E = E Word8: 11,692,768 bytes allocated in the heap 8,909,632 bytes copied during GC 2,779,776 bytes maximum residency (3 sample(s)) 92,464 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Productivity 36.9% of total user, 33.8% of total elapsed data S = S !D: 10,892,736 bytes allocated in the heap 6,401,260 bytes copied during GC 1,396,092 bytes maximum residency (3 sample(s)) 55,940 bytes maximum slop 6 MB total memory in use (0 MB lost due to fragmentation) Productivity 48.7% of total user, 33.3% of total elapsed data T = T {-# UNPACK #-} !E: 11,692,968 bytes allocated in the heap 8,909,640 bytes copied during GC 2,779,760 bytes maximum residency (3 sample(s)) 92,536 bytes maximum slop 7 MB total memory in use (0 MB lost due to fragmentation) Productivity 36.1% of total user, 31.6% of total elapsed
A similar performance gain can be obtained in the two-constructor case.
The benchmark code(save it as EnumTest.hs):
{-# LANGUAGE CPP,MagicHash , BangPatterns ,GeneralizedNewtypeDeriving #-}
module Main(main,d,e,s,t,D(..),E(..),S(..),T(..))
where
import GHC.Base
import Data.List
import Data.Word
import Control.DeepSeq
import Criterion.Main
data D = A | B | C deriving(Eq,Ord,Show,Enum,Bounded)
newtype E = E Word8 deriving(Eq,Ord,Show,Enum)
data S = S !D deriving (Eq,Ord,Show)
data T = T {-# UNPACK #-} !E deriving (Eq,Ord,Show)
-- I assume the following definitions are all correct --- otherwise
-- the whole benchmark may be useless
instance NFData D where
rnf !x = ()
instance NFData E where
rnf (E !x) = ()
instance NFData S where
rnf (S !x) = ()
instance NFData T where
rnf (T (E !x)) = ()
instance Enum S where
toEnum = S . toEnum
fromEnum (S x) = fromEnum x
instance Enum T where
toEnum = T . toEnum
fromEnum (T x) = fromEnum x
instance Bounded E where
minBound = E 0
maxBound = E 2
instance Bounded S where
minBound = S minBound
maxBound = S maxBound
instance Bounded T where
minBound = T minBound
maxBound = T maxBound
succ' :: (Eq a,Enum a,Bounded a) => a -> a
succ' x | x == maxBound = minBound
| otherwise = succ x
-- Those numbers below are for easy browsing of the assembly code
d :: D -> Int#
d x = case x of
A -> 1234#
B -> 5678#
C -> 9412#
e :: E -> Int#
e x = case x of
E 0 -> 1357#
E 1 -> 2468#
E _ -> 9914#
s :: S -> Int#
s x = case x of
S A -> 9876#
S B -> 5432#
S C -> 1097#
t :: T -> Int#
t x = case x of
T (E 0) -> 9630#
T (E 1) -> 8529#
T (E _) -> 7418#
benchmark :: IO ()
benchmark = defaultMain [ bench "d" $ whnf d' A
, bench "e" $ whnf e' (E 0)
, bench "s" $ whnf s' (S A)
, bench "t" $ whnf t' (T (E 0))
]
where
d' x = I# (d x)
e' x = I# (e x)
s' x = I# (s x)
t' x = I# (t x)
heapTest :: (NFData a,Show a,Eq a,Enum a,Bounded a) => a -> IO ()
heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x
main :: IO ()
main =
#if defined TEST_D
heapTest (A :: D)
#elif defined TEST_E
heapTest (E 0 :: E)
#elif defined TEST_S
heapTest (S A :: S)
#elif defined TEST_T
heapTest (T (E 0) :: T)
#else
benchmark
#endif
-- A minor rant:
-- For reliable statistics, I hope Criterion will run the code in *random order*,
-- at least for comparing functions with the same type. Elapsed times on my system are just too
-- noisy to conclude anything.
The makefile used for the benchmark:
GHC=/usr/bin/ghc
# If you dont't like the ATT syntax in the output assembly, use this: -fllvm -optlc --x86-asm-syntax=intel
GHC_DEBUG_FLAGS= -keep-s-file -keep-llvm-file # -optlc --x86-asm-syntax=intel
GHCFLAGS=-O2 -funbox-strict-fields -rtsopts -fllvm -fwarn-missing-signatures
GHC_MAKE=$(GHC) --make $(GHCFLAGS)
GHC_PROF_MAKE=$(GHC) -prof -auto-all -caf-all --make $(GHCFLAGS)
all : benchmark enumtest_all
enumtest_d : EnumTest.hs
$(GHC_MAKE) -o $@ $^ -DTEST_D
enumtest_e : EnumTest.hs
$(GHC_MAKE) -o $@ $^ -DTEST_E
enumtest_s : EnumTest.hs
$(GHC_MAKE) -o $@ $^ -DTEST_S
enumtest_t : EnumTest.hs
$(GHC_MAKE) -o $@ $^ -DTEST_T
enumtest_all : enumtest_d enumtest_e enumtest_s enumtest_t
for x in $^; do ./$$x +RTS -sstderr ;done
benchmark : EnumTest
time ./$^
% : %.hs
$(GHC_MAKE) -o $@ $^
%.core : %.hs
$(GHC) -S $(GHCFLAGS) $(GHC_DEBUG_FLAGS) -ddump-simpl -dsuppress-all -dsuppress-coercions -ddump-stranal $^ > $@
clean :
rm *.hi *.o *.core *.s enumtest_? ; true
Thank you very much!