13
votes

In a comment on the previous question, I claimed:

I have another benchmark to indicate ghc-7.4.1+llvm does unpacking of enumerations on strict data fields.

In fact, after some experimentation, I believe that, in at least some simplistic cases, using an enumeration is at least as fast as, and actually might be more memory efficient (hence faster in a more realistic application) than using a newtype of Word8 (or Int), even when used in a strict field of a datatype. As I said in the previous question, I have experienced the similar phenomenon in a more realistic (but still small) setting.

Can anyone point me to some relevant references as to what optimizations ghc/llvm does for enumerations? In particular, does it really unpack the internal tag of the enumeration on a strict data field? The assembly output and the result of profiling seem to suggest that that is the case, but to me there is no indication on the core level. Any insights will be greatly appreciated.

One more question: Are enumerations always at least as efficient as newtypes of the corresponding Integral, where using them makes sense? (Note that enumerations can behave like Integrals,too.) If not, what is the (hopefully realistically useful) exception? Daniel Fischer suggested in his answer that putting an enumeration on a strict field of a multiple-constructor datatype might prevent certain optimizations. However I have failed to verify this in the two-constructor case. Maybe there is a difference when putting them in a large multiple-constructor datatypes?

I'm also curious about what exactly is happening in the following benchmark. In all the four cases, roughly the same amount of bytes were allocated in the heap. However, for enumerations, the GC actually did less copying and the maximum residencies were smaller compared to newtypes.

(Actually my real question was, is it worthwhile to to try and convert enumerations to newtypes when performance matters? , but I assumed that being a bit more specific might be more helpful.)

A possible implication of this question would be this: If you are using a large number of Int's in your program which actually varies on some very small subset, then changing them to an enumeration (rather than to an unboxed type!) might be a performance gain (but careful for strictness).

Here is the summary of the benchmark, followed by the benchmark code and the convenience makefile for testing on your system.

benchmarking d
mean: 11.09113 ns, lb 11.06140 ns, ub 11.17545 ns, ci 0.950
std dev: 234.6722 ps, lb 72.31532 ps, ub 490.1156 ps, ci 0.950

benchmarking e
mean: 11.54242 ns, lb 11.51789 ns, ub 11.59720 ns, ci 0.950
std dev: 178.8556 ps, lb 73.05290 ps, ub 309.0252 ps, ci 0.950

benchmarking s
mean: 11.74964 ns, lb 11.52543 ns, ub 12.50447 ns, ci 0.950
std dev: 1.803095 ns, lb 207.2720 ps, ub 4.029809 ns, ci 0.950

benchmarking t
mean: 11.89797 ns, lb 11.86266 ns, ub 11.99105 ns, ci 0.950
std dev: 269.5795 ps, lb 81.65093 ps, ub 533.8658 ps, ci 0.950

OK,so the enumeration appears at least no less efficient than the newtype
Next,heap profiles of the function
heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x

data    D = A | B | C:
      10,892,604 bytes allocated in the heap
       6,401,260 bytes copied during GC
       1,396,092 bytes maximum residency (3 sample(s))
          55,940 bytes maximum slop
               6 MB total memory in use (0 MB lost due to fragmentation)
  Productivity  47.8% of total user, 35.4% of total elapsed

newtype E = E Word8:
      11,692,768 bytes allocated in the heap
       8,909,632 bytes copied during GC
       2,779,776 bytes maximum residency (3 sample(s))
          92,464 bytes maximum slop
               7 MB total memory in use (0 MB lost due to fragmentation)
  Productivity  36.9% of total user, 33.8% of total elapsed

data  S = S !D:
      10,892,736 bytes allocated in the heap
       6,401,260 bytes copied during GC
       1,396,092 bytes maximum residency (3 sample(s))
          55,940 bytes maximum slop
               6 MB total memory in use (0 MB lost due to fragmentation)
  Productivity  48.7% of total user, 33.3% of total elapsed

data  T = T {-# UNPACK #-} !E:
      11,692,968 bytes allocated in the heap
       8,909,640 bytes copied during GC
       2,779,760 bytes maximum residency (3 sample(s))
          92,536 bytes maximum slop
               7 MB total memory in use (0 MB lost due to fragmentation)
  Productivity  36.1% of total user, 31.6% of total elapsed

A similar performance gain can be obtained in the two-constructor case.

The benchmark code(save it as EnumTest.hs):


{-# LANGUAGE CPP,MagicHash , BangPatterns ,GeneralizedNewtypeDeriving #-}
module Main(main,d,e,s,t,D(..),E(..),S(..),T(..))
where       
import GHC.Base  
import Data.List
import Data.Word
import Control.DeepSeq
import Criterion.Main

data    D = A | B | C  deriving(Eq,Ord,Show,Enum,Bounded)
newtype E = E Word8    deriving(Eq,Ord,Show,Enum)

data    S = S                !D deriving (Eq,Ord,Show) 
data    T = T {-# UNPACK #-} !E deriving (Eq,Ord,Show)

-- I assume the following definitions are all correct --- otherwise
-- the whole benchmark may be useless
instance NFData D where
  rnf !x         = ()
instance NFData E where
  rnf (E !x)     = ()
instance NFData S where
  rnf (S !x)     = ()
instance NFData T where
  rnf (T (E !x)) = ()  

instance Enum S where
  toEnum         = S . toEnum
  fromEnum (S x) = fromEnum x 
instance Enum T where
  toEnum         = T . toEnum
  fromEnum (T x) = fromEnum x 

instance Bounded E where
  minBound = E 0
  maxBound = E 2
instance Bounded S where
  minBound = S minBound
  maxBound = S maxBound
instance Bounded T where
  minBound = T minBound
  maxBound = T maxBound

succ' :: (Eq a,Enum a,Bounded a) => a -> a
succ' x | x == maxBound = minBound
          | otherwise        = succ x

-- Those numbers below are for easy browsing of the assembly code
d :: D -> Int#
d x = case x of
  A -> 1234#
  B -> 5678#
  C -> 9412#

e :: E -> Int#
e x = case x of     
  E 0 -> 1357#
  E 1 -> 2468#
  E _ -> 9914#

s :: S -> Int#
s x = case x of     
  S A -> 9876#
  S B -> 5432#
  S C -> 1097#

t :: T -> Int#
t x = case x of     
  T (E 0) -> 9630#
  T (E 1) -> 8529#
  T (E _) -> 7418#


benchmark :: IO ()
benchmark = defaultMain [ bench "d" $ whnf d' A
                        , bench "e" $ whnf e' (E 0)
                        , bench "s" $ whnf s' (S A) 
                        , bench "t" $ whnf t' (T (E 0))
                        ]
  where
    d' x = I# (d x)
    e' x = I# (e x)
    s' x = I# (s x)
    t' x = I# (t x)    

heapTest :: (NFData a,Show a,Eq a,Enum a,Bounded a) => a -> IO ()
heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x

main :: IO ()
main =    
#if   defined TEST_D
     heapTest (A :: D)
#elif defined TEST_E                
     heapTest (E 0 :: E)
#elif defined TEST_S
     heapTest (S A :: S)
#elif defined TEST_T                
     heapTest (T (E 0) :: T)
#else
     benchmark
#endif

-- A minor rant: 
-- For reliable statistics, I hope Criterion will run the code in *random order*,
-- at least for comparing functions with the same type. Elapsed times on my system are just too
-- noisy to conclude anything.

The makefile used for the benchmark:


GHC=/usr/bin/ghc
# If you dont't like the ATT syntax in the output assembly, use this: -fllvm -optlc --x86-asm-syntax=intel
GHC_DEBUG_FLAGS= -keep-s-file -keep-llvm-file  # -optlc --x86-asm-syntax=intel
GHCFLAGS=-O2 -funbox-strict-fields -rtsopts -fllvm -fwarn-missing-signatures
GHC_MAKE=$(GHC) --make $(GHCFLAGS)
GHC_PROF_MAKE=$(GHC)  -prof  -auto-all -caf-all --make $(GHCFLAGS)

all : benchmark enumtest_all

enumtest_d : EnumTest.hs
    $(GHC_MAKE) -o $@ $^ -DTEST_D

enumtest_e : EnumTest.hs
    $(GHC_MAKE) -o $@ $^ -DTEST_E

enumtest_s : EnumTest.hs
    $(GHC_MAKE) -o $@ $^ -DTEST_S

enumtest_t : EnumTest.hs
    $(GHC_MAKE) -o $@ $^ -DTEST_T

enumtest_all : enumtest_d enumtest_e enumtest_s enumtest_t
    for x in $^; do ./$$x +RTS -sstderr ;done

benchmark : EnumTest
    time ./$^

% : %.hs
    $(GHC_MAKE) -o $@ $^

%.core : %.hs
    $(GHC)  -S  $(GHCFLAGS)   $(GHC_DEBUG_FLAGS) -ddump-simpl -dsuppress-all -dsuppress-coercions -ddump-stranal $^ > $@

clean :
    rm *.hi *.o *.core *.s enumtest_? ; true

Thank you very much!

2

2 Answers

8
votes

First

Daniel Fischer suggested in his answer that putting an enumeration on a strict field of a multiple-constructor datatype might prevent certain optimizations.

you misunderstood that. If you have a constructor C of a type, doesn't matter whether that type has multiple constructors or just one, with a strict field of type T, C ... !T ..., then the strict field can be unpacked if T is a newtype wrapper around an unpackable type, but not if T is an enumeration. It should in principle be possible to unpack the constructor tag of the value of type T too, but GHC just doesn't do that (there's probably a reason for that, it may be more complicated than I see). Nevertheless, for small enough enumeration types, the pointer tagging mentioned by Mikhail Gushenkov should have more or less the same effect (not quite, perhaps).

But for enumeration types with more than 3 or 7 (for 64-bit words) constructors, where following the pointer is necessary in some cases, the difference should manifest.

Second, the short answer to

Actually my real question was, is it worthwhile to to try and convert enumerations to newtypes when performance matters?

Sometimes.

Whether a conversion would actually improve performance, and by how much, depends on what you are doing with the values. It could also make your programme slower. And it may use more memory (see below).

There is no general rule, each case needs to be evaluated. There are patterns where newtype-wrapped Ints are faster, there are patterns where they are slower. A typical programme would contain instances of both, and one must find out which dominate.


Now to the benchmark.

I took the liberty of changing the arguments in the benchmark, using C instead of A and E 2 instead of E 0. The results were that the newtype was slightly faster for these:

warming up
estimating clock resolution...
mean is 1.549612 us (640001 iterations)
found 4506 outliers among 639999 samples (0.7%)
  3639 (0.6%) high severe
estimating cost of a clock call...
mean is 39.24624 ns (12 iterations)
found 2 outliers among 12 samples (16.7%)
  1 (8.3%) low mild
  1 (8.3%) high severe

benchmarking d
mean: 12.12989 ns, lb 12.01136 ns, ub 12.32002 ns, ci 0.950
std dev: 755.9999 ps, lb 529.5348 ps, ub 1.034185 ns, ci 0.950
found 17 outliers among 100 samples (17.0%)
  17 (17.0%) high severe
variance introduced by outliers: 59.503%
variance is severely inflated by outliers

benchmarking e
mean: 10.82692 ns, lb 10.73286 ns, ub 10.98045 ns, ci 0.950
std dev: 604.1786 ps, lb 416.5018 ps, ub 871.0923 ps, ci 0.950
found 10 outliers among 100 samples (10.0%)
  4 (4.0%) high mild
  6 (6.0%) high severe
variance introduced by outliers: 53.482%
variance is severely inflated by outliers

benchmarking s
mean: 13.18192 ns, lb 13.11898 ns, ub 13.25911 ns, ci 0.950
std dev: 354.1332 ps, lb 300.2860 ps, ub 406.2424 ps, ci 0.950
found 13 outliers among 100 samples (13.0%)
  13 (13.0%) high mild
variance introduced by outliers: 20.952%
variance is moderately inflated by outliers

benchmarking t
mean: 11.16461 ns, lb 11.02716 ns, ub 11.37018 ns, ci 0.950
std dev: 853.2152 ps, lb 602.5197 ps, ub 1.086899 ns, ci 0.950
found 14 outliers among 100 samples (14.0%)
  3 (3.0%) high mild
  11 (11.0%) high severe
variance introduced by outliers: 68.689%
variance is severely inflated by outliers

So the benchmark shows no substantial difference in either case, and the overall outcome depends on the supplied arguments. With B resp. E 1, the difference was smaller, but in my run, the newtype won there too.

Note however, that the estimated cost of a clock call is about four times as large as the mean for any of these, and the estimated clock resolution more than 100 times. I am not convinced that these results are reliable. Considering the variance I observe on my system for short benchmarks, I do not trust benchmarks for anything that runs less than 10 microseconds. I prefer tests running longer because the results are more stable and fewer outliers are produced.

Regarding

heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x

unfortunately, iterate (force . succ') doesn't force the elements of the list as it is constructed, so you get a list of thunks (of increasing depth), reverse the initial segment of that, and then force the list elements.

So the overwhelming part of work and allocation done in that is the construction of thunks and the list, and then the evaluation of the thunks. You get a more meaningful result if you prevent the building of large thunks by forcing the elements as they are generated,

iterate' :: (a -> a) -> a -> [a]
iterate' f !a = a : iterate' f (f a)

(a bang pattern - WHNF - is enough to completely evaluate values of the types in question).

Still, there is the observable and consistent difference in the allocation figures between the enumeration and the newtype variants, and that remains also with iterate'. And also if instead of reversing an initial segment and taking the head of that, one simply takes list !! index, where it becomes even more impressive because the other figures (copied bytes, maximum residency) then are small.

The reason is that list elements can't be unboxed, so the list cells contain pointers to the element values, and the enumeration values are shared (there exists only one A in the entire programme), so all these pointers point to the same object, but integers aren't shared, so each list cell points to a different object.

The difference in allocation is on your system almost exactly 800,000 bytes, on mine 1,600,000.

That is what 200,000 words need, what allocation of 100,000 Word8 (or Word; Int, ...) requires (per value, one word for the constructor, and one for a Word# or Int#).

7
votes

Without looking at the compiler output, I think that the absence of speed increase in the newtype version of your code may be due to pointer tagging. On x86, GHC reserves 2 bits in each pointer for information about the pointed-to closure. 00 means "unevaluated or unknown", and the other 3 cases encode the actual tag of the evaluated constructor. This information is updated dynamically by the garbage collector. Since you have only 3 cases in your test data type, they always fit in the tag bits, and therefore pattern-matching never requires an indirection. Try adding more cases to your data type, and see what happens. You can find more information about dynamic pointer tagging in this paper:

Faster laziness using dynamic pointer tagging

Simon Marlow, Alexey Rodriguez Yakushev, and Simon Peyton Jones, ICFP 2007.