Reasonably efficient pure-functional matrix product in Haskell?

Question

I know Haskell a little bit, and I wonder if it's possible to write something like a matrix-matrix product in Haskell that is all of the following:

Pure-functional: no IO or State monads in its type signature (I don't care what happens in the function body. That is, I don't care if the function body uses monads, as long as the whole function is pure). I may want to use this matrix-matrix product in a pure function.
Memory-safe: no malloc or pointers. I know that it's possible to "write C" in Haskell, but you lose memory safety. Actually writing this code in C and interfacing it with Haskell also loses memory safety.
As efficient as, say, Java. For concreteness, let's assume I'm talking about a simple triple loop, single precision, contiguous column-major layout (float[], not float[][]) and matrices of size 1000x1000, and a single-core CPU. (If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark.)

(I don't want this to sound like a challenge, but note that Java can satisfy all of the above easily.)

I already know that

The triple loop implementation is not the most efficient one. It's quite cache-oblivious. It's better to use a well-written BLAS implementation in this particular case. However, one can not always count on a C library being available for what one is trying to do. I wonder if reasonably efficient code can be written in normal Haskell.
Some people wrote whole research papers that demonstrate #3. However, I'm not a computer science researcher. I wonder if it's possible to keep simple things simple in Haskell.
The Gentle Introduction to Haskell has a matrix product implementation. It wouldn't satisfy the above requirements though.

Addressing comments:

I have three reasons: first, the "no malloc or pointers" requirement is as yet ill-defined (I challenge you to write any piece of Haskell code which uses no pointers);

I saw plenty of Haskell programs not using Ptr. Perhaps it refers to the fact that at the machine instruction level, pointers will be used? That's not what I meant. I was referring to the abstraction level of the Haskell source code.

second, the attack on CS research is out of place (and furthermore I can't imagine anything simpler than using code somebody else has already written for you); third, there are many matrix packages on Hackage (and the prep work for asking this question should include reviewing and rejecting each).

It seems that your #2 and #3 are the same ("use existing libraries"). I'm interested in the matrix product as a simple test of what Haskell can do on its own, and whether it allows you to "keep simple things simple". I could have easily come up with a numerical problem that doesn't have any ready libraries, but then I'd have to explain the problem, whereas everyone already knows what a matrix product is.

How can Java possibly satisfy 1.? Any Java method is essentially :: IORef Arg -> ... -> IORef This -> IO Ret

This goes to the root of my question, actually (+1). While Java does not claim to track purity, Haskell does. In Java, whether the function is pure or not is indicated in the comments. I can claim that the matrix product is pure, even though I do mutation in the function body. The question is whether Haskell's approach (purity encoded in the type system) is compatible with efficiency, memory-safety and simplicity.

You say "I don't care what happens in the function body" and you say "no malloc or pointers". Which one is it? Do you want to clarify those points? — Dietrich Epp
How can Java possibly satisfy 1.? Any Java method is essentially :: IORef Arg -> ... -> IORef This -> IO Ret. — leftaroundabout
This question is clear and on-topic (and so I think shouldn't be closed); however, I'm downvoting. I have three reasons: first, the "no malloc or pointers" requirement is as yet ill-defined (I challenge you to write any piece of Haskell code which uses no pointers); second, the attack on CS research is out of place (and furthermore I can't imagine anything simpler than using code somebody else has already written for you); third, there are many matrix packages on Hackage (and the prep work for asking this question should include reviewing and rejecting each). — Daniel Wagner
@DanielWagner: "How can one implement fast matrix multiplication in a pure functional style while maintaining simplicity?" is an exceptional question with wide applicability that deserves up-votes. It is clear this is what OP was asking. I can not understand why anyone would down-vote this question. — Philip JF
There's too much baggage in this question about programming paradigms, so its just going to elicit long, non-specific responses. — Don Stewart

Daniel Fischer Daniel Fischer · Accepted Answer · 2012-08-02T12:01:55

As efficient as, say, Java. For concreteness, let's assume I'm talking about a simple triple loop, single precision, contiguous column-major layout (float[], not float[][]) and matrices of size 1000x1000, and a single-core CPU. (If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark)

So something like

public class MatrixProd {
    static float[] matProd(float[] a, int ra, int ca, float[] b, int rb, int cb) {
        if (ca != rb) {
            throw new IllegalArgumentException("Matrices not fitting");
        }
        float[] c = new float[ra*cb];
        for(int i = 0; i < ra; ++i) {
            for(int j = 0; j < cb; ++j) {
                float sum = 0;
                for(int k = 0; k < ca; ++k) {
                    sum += a[i*ca+k]*b[k*cb+j];
                }
                c[i*cb+j] = sum;
            }
        }
        return c;
    }

    static float[] mkMat(int rs, int cs, float x, float d) {
        float[] arr = new float[rs*cs];
        for(int i = 0; i < rs; ++i) {
            for(int j = 0; j < cs; ++j) {
                arr[i*cs+j] = x;
                x += d;
            }
        }
        return arr;
    }

    public static void main(String[] args) {
        int sz = 100;
        float strt = -32, del = 0.0625f;
        if (args.length > 0) {
            sz = Integer.parseInt(args[0]);
        }
        if (args.length > 1) {
            strt = Float.parseFloat(args[1]);
        }
        if (args.length > 2) {
            del = Float.parseFloat(args[2]);
        }
        float[] a = mkMat(sz,sz,strt,del);
        float[] b = mkMat(sz,sz,strt-16,del);
        System.out.println(a[sz*sz-1]);
        System.out.println(b[sz*sz-1]);
        long t0 = System.currentTimeMillis();
        float[] c = matProd(a,sz,sz,b,sz,sz);
        System.out.println(c[sz*sz-1]);
        long t1 = System.currentTimeMillis();
        double dur = (t1-t0)*1e-3;
        System.out.println(dur);
    }
}

I suppose? (I hadn't read the specs properly before coding, so the layout is row-major, but since the access pattern is the same, that doesn't make a difference as mixing layouts would, so I'll assume that's okay.)

I haven't spent any time on thinking about a clever algorithm or low-level optimisation tricks (I wouldn't achieve much in Java with those anyway). I just wrote the simple loop, because

I don't want this to sound like a challenge, but note that Java can satisfy all of the above easily

And that's what Java gives easily, so I'll take that.

(If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark)

Nowhere near, I'm afraid, neither in Java nor in Haskell. Too many cache misses to reach that throughput with the simple triple loop.

Doing the same in Haskell, again no thinking about being clever, a plain straightforward triple loop:

{-# LANGUAGE BangPatterns #-}
module MatProd where

import Data.Array.ST
import Data.Array.Unboxed

matProd :: UArray Int Float -> Int -> Int -> UArray Int Float -> Int -> Int -> UArray Int Float
matProd a ra ca b rb cb =
    let (al,ah)     = bounds a
        (bl,bh)     = bounds b
        {-# INLINE getA #-}
        getA i j    = a!(i*ca + j)
        {-# INLINE getB #-}
        getB i j    = b!(i*cb + j)
        {-# INLINE idx #-}
        idx i j     = i*cb + j
    in if al /= 0 || ah+1 /= ra*ca || bl /= 0 || bh+1 /= rb*cb || ca /= rb
         then error $ "Matrices not fitting: " ++ show (ra,ca,al,ah,rb,cb,bl,bh)
         else runSTUArray $ do
            arr <- newArray (0,ra*cb-1) 0
            let outer i j
                    | ra <= i   = return arr
                    | cb <= j   = outer (i+1) 0
                    | otherwise = do
                        !x <- inner i j 0 0
                        writeArray arr (idx i j) x
                        outer i (j+1)
                inner i j k !y
                    | ca <= k   = return y
                    | otherwise = inner i j (k+1) (y + getA i k * getB k j)
            outer 0 0

mkMat :: Int -> Int -> Float -> Float -> UArray Int Float
mkMat rs cs x d = runSTUArray $ do
    let !r = rs - 1
        !c = cs - 1
        {-# INLINE idx #-}
        idx i j = cs*i + j
    arr <- newArray (0,rs*cs-1) 0
    let outer i j y
            | r < i     = return arr
            | c < j     = outer (i+1) 0 y
            | otherwise = do
                writeArray arr (idx i j) y
                outer i (j+1) (y + d)
    outer 0 0 x

and the calling module

module Main (main) where

import System.Environment (getArgs)
import Data.Array.Unboxed

import System.CPUTime
import Text.Printf

import MatProd

main :: IO ()
main = do
    args <- getArgs
    let (sz, strt, del) = case args of
                            (a:b:c:_) -> (read a, read b, read c)
                            (a:b:_)   -> (read a, read b, 0.0625)
                            (a:_)     -> (read a, -32, 0.0625)
                            _         -> (100, -32, 0.0625)
        a = mkMat sz sz strt del
        b = mkMat sz sz (strt - 16) del
    print (a!(sz*sz-1))
    print (b!(sz*sz-1))
    t0 <- getCPUTime
    let c = matProd a sz sz b sz sz
    print $ c!(sz*sz-1)
    t1 <- getCPUTime
    printf "%.6f\n" (fromInteger (t1-t0)*1e-12 :: Double)

So we're doing almost exactly the same things in both languages. Compile the Haskell with -O2, the Java with javac

$ java MatrixProd 1000 "-13.7" 0.013
12915.623
12899.999
8.3592897E10
8.193
$ ./vmmult 1000 "-13.7" 0.013
12915.623
12899.999
8.35929e10
8.558699

And the resulting times are quite close.

And if we compile the Java code to native, with gcj -O3 -Wall -Wextra --main=MatrixProd -fno-bounds-check -fno-store-check -o jmatProd MatrixProd.java,

$ ./jmatProd 1000 "-13.7" 0.013
12915.623
12899.999
8.3592896512E10
8.215

there's still no big difference.

As a special bonus, the same algorithm in C (gcc -O3):

$ ./cmatProd 1000 "-13.7" 0.013
12915.623047
12899.999023
8.35929e+10
8.079759

So this reveals no fundamental difference between straightforward Java and straightforward Haskell when it comes to computationally intensive tasks using floating point numbers (when dealing with integer arithmetic on medium to large numbers, the use of GMP by GHC makes Haskell outperform Java's BigInteger by a huge margin for many tasks, but that is of course a library issue, not a language one), and both are close to C with this algorithm.

In all fairness, though, that is because the access pattern causes a cache-miss every other nanosecond, so in all three languages this computation is memory-bound.

If we improve the access pattern by multiplying a row-major matrix with a column-major matrix, all become faster, the gcc-compiled C finishes it 1.18s, java takes 1.23s and the ghc-compiled Haskell takes around 5.8s, which can be reduced to 3 seconds by using the llvm backend.

Here, the range-check by the array library really hurts. Using the unchecked array access (as one should, after checking for bugs, since the checks are already done in the code controlling the loops), GHC's native backend finishes in 2.4s, going via the llvm backend lets the computation finish in 1.55s, which is decent, although significantly slower than both C and Java. Using the primitives from GHC.Prim instead of the array library, the llvm backend produces code that runs in 1.16s (again, without bounds-checking on each access, but that only valid indices are produced during the computation can in this case easily be proved before, so here, no memory-safety is sacrificed¹; checking each access brings the time up to 1.96s, still significantly better than the bounds checking of the array library).

Bottom line: GHC needs (much) faster branching for the bounds-checking, and there's room for improvement in the optimiser, but in principle, "Haskell's approach (purity encoded in the type system) is compatible with efficiency, memory-safety and simplicity", we're just not yet there. For the time being, one has to decide how much of which point one is willing to sacrifice.

¹ Yes, that's a special case, in general omitting the bounds-check does sacrifice memory-safety, or it is at least harder to prove that it doesn't.

Reasonably efficient pure-functional matrix product in Haskell?

3 Answers

Bit of history

Back to Haskell

About purity

Final answer