As efficient as, say, Java. For concreteness, let's assume I'm talking about a simple triple loop, single precision, contiguous column-major layout (float[], not float[][]) and matrices of size 1000x1000, and a single-core CPU. (If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark)
So something like
public class MatrixProd {
static float[] matProd(float[] a, int ra, int ca, float[] b, int rb, int cb) {
if (ca != rb) {
throw new IllegalArgumentException("Matrices not fitting");
}
float[] c = new float[ra*cb];
for(int i = 0; i < ra; ++i) {
for(int j = 0; j < cb; ++j) {
float sum = 0;
for(int k = 0; k < ca; ++k) {
sum += a[i*ca+k]*b[k*cb+j];
}
c[i*cb+j] = sum;
}
}
return c;
}
static float[] mkMat(int rs, int cs, float x, float d) {
float[] arr = new float[rs*cs];
for(int i = 0; i < rs; ++i) {
for(int j = 0; j < cs; ++j) {
arr[i*cs+j] = x;
x += d;
}
}
return arr;
}
public static void main(String[] args) {
int sz = 100;
float strt = -32, del = 0.0625f;
if (args.length > 0) {
sz = Integer.parseInt(args[0]);
}
if (args.length > 1) {
strt = Float.parseFloat(args[1]);
}
if (args.length > 2) {
del = Float.parseFloat(args[2]);
}
float[] a = mkMat(sz,sz,strt,del);
float[] b = mkMat(sz,sz,strt-16,del);
System.out.println(a[sz*sz-1]);
System.out.println(b[sz*sz-1]);
long t0 = System.currentTimeMillis();
float[] c = matProd(a,sz,sz,b,sz,sz);
System.out.println(c[sz*sz-1]);
long t1 = System.currentTimeMillis();
double dur = (t1-t0)*1e-3;
System.out.println(dur);
}
}
I suppose? (I hadn't read the specs properly before coding, so the layout is row-major, but since the access pattern is the same, that doesn't make a difference as mixing layouts would, so I'll assume that's okay.)
I haven't spent any time on thinking about a clever algorithm or low-level optimisation tricks (I wouldn't achieve much in Java with those anyway). I just wrote the simple loop, because
I don't want this to sound like a challenge, but note that Java can satisfy all of the above easily
And that's what Java gives easily, so I'll take that.
(If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark)
Nowhere near, I'm afraid, neither in Java nor in Haskell. Too many cache misses to reach that throughput with the simple triple loop.
Doing the same in Haskell, again no thinking about being clever, a plain straightforward triple loop:
{-# LANGUAGE BangPatterns #-}
module MatProd where
import Data.Array.ST
import Data.Array.Unboxed
matProd :: UArray Int Float -> Int -> Int -> UArray Int Float -> Int -> Int -> UArray Int Float
matProd a ra ca b rb cb =
let (al,ah) = bounds a
(bl,bh) = bounds b
{-# INLINE getA #-}
getA i j = a!(i*ca + j)
{-# INLINE getB #-}
getB i j = b!(i*cb + j)
{-# INLINE idx #-}
idx i j = i*cb + j
in if al /= 0 || ah+1 /= ra*ca || bl /= 0 || bh+1 /= rb*cb || ca /= rb
then error $ "Matrices not fitting: " ++ show (ra,ca,al,ah,rb,cb,bl,bh)
else runSTUArray $ do
arr <- newArray (0,ra*cb-1) 0
let outer i j
| ra <= i = return arr
| cb <= j = outer (i+1) 0
| otherwise = do
!x <- inner i j 0 0
writeArray arr (idx i j) x
outer i (j+1)
inner i j k !y
| ca <= k = return y
| otherwise = inner i j (k+1) (y + getA i k * getB k j)
outer 0 0
mkMat :: Int -> Int -> Float -> Float -> UArray Int Float
mkMat rs cs x d = runSTUArray $ do
let !r = rs - 1
!c = cs - 1
{-# INLINE idx #-}
idx i j = cs*i + j
arr <- newArray (0,rs*cs-1) 0
let outer i j y
| r < i = return arr
| c < j = outer (i+1) 0 y
| otherwise = do
writeArray arr (idx i j) y
outer i (j+1) (y + d)
outer 0 0 x
and the calling module
module Main (main) where
import System.Environment (getArgs)
import Data.Array.Unboxed
import System.CPUTime
import Text.Printf
import MatProd
main :: IO ()
main = do
args <- getArgs
let (sz, strt, del) = case args of
(a:b:c:_) -> (read a, read b, read c)
(a:b:_) -> (read a, read b, 0.0625)
(a:_) -> (read a, -32, 0.0625)
_ -> (100, -32, 0.0625)
a = mkMat sz sz strt del
b = mkMat sz sz (strt - 16) del
print (a!(sz*sz-1))
print (b!(sz*sz-1))
t0 <- getCPUTime
let c = matProd a sz sz b sz sz
print $ c!(sz*sz-1)
t1 <- getCPUTime
printf "%.6f\n" (fromInteger (t1-t0)*1e-12 :: Double)
So we're doing almost exactly the same things in both languages. Compile the Haskell with -O2
, the Java with javac
$ java MatrixProd 1000 "-13.7" 0.013
12915.623
12899.999
8.3592897E10
8.193
$ ./vmmult 1000 "-13.7" 0.013
12915.623
12899.999
8.35929e10
8.558699
And the resulting times are quite close.
And if we compile the Java code to native, with gcj -O3 -Wall -Wextra --main=MatrixProd -fno-bounds-check -fno-store-check -o jmatProd MatrixProd.java
,
$ ./jmatProd 1000 "-13.7" 0.013
12915.623
12899.999
8.3592896512E10
8.215
there's still no big difference.
As a special bonus, the same algorithm in C (gcc -O3):
$ ./cmatProd 1000 "-13.7" 0.013
12915.623047
12899.999023
8.35929e+10
8.079759
So this reveals no fundamental difference between straightforward Java and straightforward Haskell when it comes to computationally intensive tasks using floating point numbers (when dealing with integer arithmetic on medium to large numbers, the use of GMP by GHC makes Haskell outperform Java's BigInteger by a huge margin for many tasks, but that is of course a library issue, not a language one), and both are close to C with this algorithm.
In all fairness, though, that is because the access pattern causes a cache-miss every other nanosecond, so in all three languages this computation is memory-bound.
If we improve the access pattern by multiplying a row-major matrix with a column-major matrix, all become faster, the gcc-compiled C finishes it 1.18s, java takes 1.23s and the ghc-compiled Haskell takes around 5.8s, which can be reduced to 3 seconds by using the llvm backend.
Here, the range-check by the array library really hurts. Using the unchecked array access (as one should, after checking for bugs, since the checks are already done in the code controlling the loops), GHC's native backend finishes in 2.4s, going via the llvm backend lets the computation finish in 1.55s, which is decent, although significantly slower than both C and Java. Using the primitives from GHC.Prim
instead of the array library, the llvm backend produces code that runs in 1.16s (again, without bounds-checking on each access, but that only valid indices are produced during the computation can in this case easily be proved before, so here, no memory-safety is sacrificed¹; checking each access brings the time up to 1.96s, still significantly better than the bounds checking of the array library).
Bottom line: GHC needs (much) faster branching for the bounds-checking, and there's room for improvement in the optimiser, but in principle, "Haskell's approach (purity encoded in the type system) is compatible with efficiency, memory-safety and simplicity", we're just not yet there. For the time being, one has to decide how much of which point one is willing to sacrifice.
¹ Yes, that's a special case, in general omitting the bounds-check does sacrifice memory-safety, or it is at least harder to prove that it doesn't.
:: IORef Arg -> ... -> IORef This -> IO Ret
. – leftaroundabout