I am writing a longest common subsequence algorithm in Haskell using vector library and state monad (to encapsulate the very imperative and mutable nature of Miller O(NP) algorithm). I have already written it in C for some project I needed it for, and am now writing it in Haskell as a way to explore how to write those imperative grid-walk algorithms with good performance that match C. The version I wrote with unboxed vectors is about 4-times slower than C version for same inputs (and compiled with right optimization flags - I used both system clock time and Criterion
methods to validate the relative time measurements between Haskell and C versions, and same data types, both large and tiny inputs). I have been trying to figure out where the performance issues might be, and will appreciate feedback - it is possible there is some well-known performance issue I might have run into here, especially in vector library which I am heavily using here.
In my code, I have one function called gridWalk that is called most often, and also, does most of the work. The performance slowdown is likely to be there, but I can't figure out what it could be. Complete Haskell code is here. Snippets of the code below:
import Data.Vector.Unboxed.Mutable as MU
import Data.Vector.Unboxed as U hiding (mapM_)
import Control.Monad.ST as ST
import Control.Monad.Primitive (PrimState)
import Control.Monad (when)
import Data.STRef (newSTRef, modifySTRef, readSTRef)
import Data.Int
type MVI1 s = MVector (PrimState (ST s)) Int
cmp :: U.Vector Int32 -> U.Vector Int32 -> Int -> Int -> Int
cmp a b i j = go 0 i j
where
n = U.length a
m = U.length b
go !len !i !j| (i<n) && (j<m) && ((unsafeIndex a i) == (unsafeIndex b j)) = go (len+1) (i+1) (j+1)
| otherwise = len
-- function to find previous y on diagonal k for furthest point
findYP :: MVI1 s -> Int -> Int -> ST s (Int,Int)
findYP fp k offset = do
let k0 = k+offset-1
k1 = k+offset+1
y0 <- MU.unsafeRead fp k0 >>= \x -> return $ 1+x
y1 <- MU.unsafeRead fp k1
if y0 > y1 then return (k0,y0)
else return (k1,y1)
{-#INLINE findYP #-}
gridWalk :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> ST s ()
gridWalk a b fp !k cmp = {-#SCC gridWalk #-} do
let !offset = 1+U.length a
(!kp,!yp) <- {-#SCC findYP #-} findYP fp k offset
let xp = yp-k
len = {-#SCC cmp #-} cmp a b xp yp
x = xp+len
y = yp+len
{-#SCC "updateFP" #-} MU.unsafeWrite fp (k+offset) y
return ()
{-#INLINE gridWalk #-}
-- The function below executes ct times, and updates furthest point as they are found during furthest point search
findSnakes :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> (Int -> Int -> Int) -> ST s ()
findSnakes a b fp !k !ct cmp op = {-#SCC findSnakes #-} U.forM_ (U.fromList [0..ct-1]) (\x -> gridWalk a b fp (op k x) cmp)
{-#INLINE findSnakes #-}
I added some cost center annotations, and ran profiling with a certain LCS input for testing. Here is what I get:
total time = 2.39 secs (2394 ticks @ 1000 us, 1 processor)
total alloc = 4,612,756,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
gridWalk Main 67.5 52.7
findSnakes Main 23.2 27.8
cmp Main 4.2 0.0
findYP Main 3.5 19.4
updateFP Main 1.6 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 64 0 0.0 0.0 100.0 100.0
main Main 129 0 0.0 0.0 0.0 0.0
CAF Main 127 0 0.0 0.0 100.0 100.0
findSnakes Main 141 0 0.0 0.0 0.0 0.0
main Main 128 1 0.0 0.0 100.0 100.0
findSnakes Main 138 0 0.0 0.0 0.0 0.0
gridWalk Main 139 0 0.0 0.0 0.0 0.0
cmp Main 140 0 0.0 0.0 0.0 0.0
while Main 132 4001 0.1 0.0 100.0 100.0
findSnakes Main 133 12000 23.2 27.8 99.9 99.9
gridWalk Main 134 16004000 67.5 52.7 76.7 72.2
cmp Main 137 16004000 4.2 0.0 4.2 0.0
updateFP Main 136 16004000 1.6 0.0 1.6 0.0
findYP Main 135 16004000 3.5 19.4 3.5 19.4
newVI1 Main 130 1 0.0 0.0 0.0 0.0
newVI1.\ Main 131 8004 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 112 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 104 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 102 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 95 0 0.0 0.0 0.0 0.0
If I am interpreting the profiling output correctly (and assuming there are not too much distortions due to profiling), gridWalk
takes most of the time, but the main functions cmp
and findYP
that do the heavy lifting in gridWalk
, seem to take very little time in profiling report. So, perhaps the bottleneck is in forM_
wrapper that findSnakes
function uses to call gridWalk
? The heap profile too seems very clean:
Reading the core, nothing really jumps out. I thought some values in inner loops might be boxed, but I don't spot them in the core. I hope the performance issue is due to something simple I have missed.
Update
Following suggestion of @DanielFischer, I replaced forM_
of Data.Vector.Unboxed
with that of Control.Monad
in findSnakes
function which improved the performance from 4x to 2.5x of C version. Haskell and C versions are now posted here if you want to try them out.
I am still digging through the core to see where the bottlenecks are. gridWalk
is most frequently called function, and for it to perform well, lcsh
should reduce whileM_
loop to a nice iterative inner loop of condition check and inlined findSnakes
code. I suspect that in assembly, this is not the case for whileM_
loop, but since I am not very knowledgeable about translating core, and locating name-mangled GHC functions in assembly, I guess it is just a matter of patiently plugging away at the problem until I figure it out. Meanwhile if there are any pointers on performance fixes, they will be appreciated.
Another possibility that I can think of is overhead of heap checks during function calls. As seen in profiling report, gridWalk
is called 16004000 times. Assuming 6 cycles for heap check (I am guessing it is less, but still let us assume that), it is ~0.02 seconds on a 3.33GHz box for 96024000 cycles.
Also, some performance numbers:
Haskell code (GHC 7.6.1 x86_64)
: It was ~0.25s before forM_
fix.
time ./T
1
real 0m0.150s
user 0m0.145s
sys 0m0.003s
C code (gcc 4.7.2 x86_64)
:
time ./test
1
real 0m0.065s
user 0m0.063s
sys 0m0.000s
Update 2:
Updated code is here. Using STUArray
doesn't change the numbers either. The performance is about 1.5x on Mac OS X (x86_64,ghc7.6.1)
, pretty similar to what @DanielFischer reported on Linux.
Haskell code:
$ time ./Diff
1
real 0m0.087s
user 0m0.084s
sys 0m0.003s
C code:
$ time ./test
1
real 0m0.056s
user 0m0.053s
sys 0m0.002s
Glancing at the cmm
, the call is tail-recursive, and is turned into a loop by llvm
. But each new iteration seems to allocate new values which invokes heap check too, and so, might explain the difference in performance. I have to think about how to write the tail recursion in such a way such that no values are allocated across iterations, avoiding heap-check and allocation overhead.