7
votes

I heard a lot about amazing performance of programs written in Haskell, and wanted to make some tests. So, I wrote a 'library' for matrix operations just to compare it's performance with the same stuff written in pure C. First of all I tested 500000 matrices multiplication performance, and noticed that it was... never-ending (i. e. ending with out of memory exception after 10 minutes of so)! After studying haskell a bit more I managed to get rid of laziness and the best result I managed to get is ~20 times slower than its equivalent in C. So, the question: could you review the code below and tell if its performance can be improved a bit more? 20 times is still disappointing me a bit.

import Prelude hiding (foldr, foldl, product)
import Data.Monoid
import Data.Foldable

import Text.Printf
import System.CPUTime

import System.Environment

data Vector a = Vec3 a a a
              | Vec4 a a a a
                deriving Show

instance Foldable Vector where
  foldMap f (Vec3 a b c) = f a `mappend` f b `mappend` f c
  foldMap f (Vec4 a b c d) = f a `mappend` f b `mappend` f c `mappend` f d

data Matr a = Matr !a !a !a !a
                   !a !a !a !a
                   !a !a !a !a
                   !a !a !a !a

instance Show a => Show (Matr a) where
  show m = foldr f [] $ matrRows m
            where f a b = show a ++ "\n" ++ b

matrCols (Matr a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3)
              = [Vec4 a0 a1 a2 a3, Vec4 b0 b1 b2 b3, Vec4 c0 c1 c2 c3, Vec4 d0 d1 d2 d3]

matrRows (Matr a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3)
              = [Vec4 a0 b0 c0 d0, Vec4 a1 b1 c1 d1, Vec4 a2 b2 c2 d2, Vec4 a3 b3 c3 d3]

matrFromList [a0, b0, c0, d0, a1, b1, c1, d1, a2, b2, c2, d2, a3, b3, c3, d3]
        = Matr a0 b0 c0 d0
               a1 b1 c1 d1
               a2 b2 c2 d2
               a3 b3 c3 d3

matrId :: Matr Double
matrId  = Matr 1 0 0 0
               0 1 0 0
               0 0 1 0
               0 0 0 1

normalise (Vec4 x y z w) = Vec4 (x/w) (y/w) (z/w) 1

mult a b = matrFromList [f r c | r <- matrRows a, c <- matrCols b] where 
            f a b = foldr (+) 0 $ zipWith (*) (toList a) (toList b)
2
Your matrix mult is changing representations between matrices, lists and back to matrices. Whilst this does allow you to do the the computation in two lines of code, it will be very, very slow.stephen tetley
@stephen tetley: it ain't necessarily so. The compilers are rather good at inlining and deforestation these days, so a smart compiler would elide the intermediate lists. In my tests this implementation, compared to a "fast" hand-written unrolled multiplication, is not terribly slow (just about 1.5 times slower). UPD: my mistake, actually 6 times slower for Double matrices.n. 1.8e9-where's-my-share m.
At the risk of being labelled a troll, I don't think you'll find results like you're looking for with these tests. Generally the best Haskell will give performance on par with good C. The big advantages of Haskell come from readability, more expressivity compared to C, and advanced libraries such as STM and parallel, which make adding parallelism very easy compared to most other languages. For pure numerical work, it's quite difficult for Haskell performance to beat C implementations.John L
@John L: The biggest performance win with Haskell, to my mind, is how often you can just write the simple, naive, straightforward version of something and have GHC optimize it to give good (but not great) performance.C. A. McCann
@Maxym - I was intending that you do the matrix multiplication "by hand" by pattern matching on the constructors then chaining the addition and multiplication that calculates each element as per (a*m+b*p+c*s). Of course, the code is long winded for the 4x4 matrix. Lists are simply "wrong" for matrices as indexing is slow, they have no bounds, etc.stephen tetley

2 Answers

8
votes

First, I doubt that you'll ever get stellar performance with this implementation. There are too many conversions between different representations. You'd be better off basing your code on something like the vector package. Also you don't provide all your testing code, so there are probably other issues that we can't here. This is because the pipeline of production to consumption has a big impact on Haskell performance, and you haven't provided either end.

Now, two specific problems:

1) Your vector is defined as either a 3 or 4 element vector. This means that for every vector there's an extra check to see how many elements are in use. In C, I imagine your implementation is probably closer to

struct vec {
  double *vec;
  int length;
}

You should do something similar in Haskell; this is how vector and bytestring are implemented for example.

Even if you don't change the Vector definition, make the fields strict. You should also either add UNPACK pragmas (to Vector and Matrix) or compile with -funbox-strict-fields.

2) Change mult to

mult a b = matrFromList [f r c | r <- matrRows a, c <- matrCols b] where 
            f a b = Data.List.foldl' (+) 0 $ zipWith (*) (toList a) (toList b)

The extra strictness of foldl' will give much better performance in this case than foldr.

This change alone might make a big difference, but without seeing the rest of your code it's difficult to say.

4
votes

Answering my own question just to share new results I got yesterday:

  1. I upgraded ghc to the most recent version and performance became indeed not that bad (only ~7 times worse).

  2. Also I tried implementing the matrix in a stupid and simple way (see the listing below) and got really acceptable performance - only about 2 times slower than C equivalent.

    data Matr a = Matr ( a, a, a, a
                       , a, a, a, a
                       , a, a, a, a
                       , a, a, a, a) 
    
    mult (Matr (!a0,  !b0,  !c0,  !d0,  
                !a1,  !b1,  !c1,  !d1,  
                !a2,  !b2,  !c2,  !d2,  
                !a3,  !b3,  !c3,  !d3))
         (Matr (!a0', !b0', !c0', !d0', 
                !a1', !b1', !c1', !d1', 
                !a2', !b2', !c2', !d2', 
                !a3', !b3', !c3', !d3'))
         = Matr ( a0'', b0'', c0'', d0''
                , a1'', b1'', c1'', d1''
                , a2'', b2'', c2'', d2''
                , a3'', b3'', c3'', d3'')
            where a0'' = a0 * a0' + b0 * a1' + c0 * a2' + d0 * a3'
                  b0'' = a0 * b0' + b0 * b1' + c0 * b2' + d0 * b3'
                  c0'' = a0 * c0' + b0 * c1' + c0 * c2' + d0 * c3'
                  d0'' = a0 * d0' + b0 * d1' + c0 * d2' + d0 * d3'
    
                  a1'' = a1 * a0' + b1 * a1' + c1 * a2' + d1 * a3'
                  b1'' = a1 * b0' + b1 * b1' + c1 * b2' + d1 * b3'
                  c1'' = a1 * c0' + b1 * c1' + c1 * c2' + d1 * c3'
                  d1'' = a1 * d0' + b1 * d1' + c1 * d2' + d1 * d3'
    
                  a2'' = a2 * a0' + b2 * a1' + c2 * a2' + d2 * a3'
                  b2'' = a2 * b0' + b2 * b1' + c2 * b2' + d2 * b3'
                  c2'' = a2 * c0' + b2 * c1' + c2 * c2' + d2 * c3'
                  d2'' = a2 * d0' + b2 * d1' + c2 * d2' + d2 * d3'
    
                  a3'' = a3 * a0' + b3 * a1' + c3 * a2' + d3 * a3'
                  b3'' = a3 * b0' + b3 * b1' + c3 * b2' + d3 * b3'
                  c3'' = a3 * c0' + b3 * c1' + c3 * c2' + d3 * c3'
                  d3'' = a3 * d0' + b3 * d1' + c3 * d2' + d3 * d3'