Debugging performance bottlenecks for longest common subsequence algorithm

Question

I am writing a longest common subsequence algorithm in Haskell using vector library and state monad (to encapsulate the very imperative and mutable nature of Miller O(NP) algorithm). I have already written it in C for some project I needed it for, and am now writing it in Haskell as a way to explore how to write those imperative grid-walk algorithms with good performance that match C. The version I wrote with unboxed vectors is about 4-times slower than C version for same inputs (and compiled with right optimization flags - I used both system clock time and Criterion methods to validate the relative time measurements between Haskell and C versions, and same data types, both large and tiny inputs). I have been trying to figure out where the performance issues might be, and will appreciate feedback - it is possible there is some well-known performance issue I might have run into here, especially in vector library which I am heavily using here.

In my code, I have one function called gridWalk that is called most often, and also, does most of the work. The performance slowdown is likely to be there, but I can't figure out what it could be. Complete Haskell code is here. Snippets of the code below:

import Data.Vector.Unboxed.Mutable as MU
import Data.Vector.Unboxed as U hiding (mapM_)
import Control.Monad.ST as ST
import Control.Monad.Primitive (PrimState)
import Control.Monad (when) 
import Data.STRef (newSTRef, modifySTRef, readSTRef)
import Data.Int


type MVI1 s  = MVector (PrimState (ST s)) Int

cmp :: U.Vector Int32 -> U.Vector Int32 -> Int -> Int -> Int
cmp a b i j = go 0 i j
               where
                 n = U.length a
                 m = U.length b
                 go !len !i !j| (i<n) && (j<m) && ((unsafeIndex a i) == (unsafeIndex b j)) = go (len+1) (i+1) (j+1)
                                    | otherwise = len

-- function to find previous y on diagonal k for furthest point 
findYP :: MVI1 s -> Int -> Int -> ST s (Int,Int)
findYP fp k offset = do
              let k0 = k+offset-1
                  k1 = k+offset+1
              y0 <- MU.unsafeRead fp k0 >>= \x -> return $ 1+x
              y1 <- MU.unsafeRead fp k1
              if y0 > y1 then return (k0,y0)
              else return (k1,y1)
{-#INLINE findYP #-}

gridWalk :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> ST s ()
gridWalk a b fp !k cmp = {-#SCC gridWalk #-} do
   let !offset = 1+U.length a
   (!kp,!yp) <- {-#SCC findYP #-} findYP fp k offset                          
   let xp = yp-k
       len = {-#SCC cmp #-} cmp a b xp yp
       x = xp+len
       y = yp+len

   {-#SCC "updateFP" #-} MU.unsafeWrite fp (k+offset) y  
   return ()
{-#INLINE gridWalk #-}

-- The function below executes ct times, and updates furthest point as they are found during furthest point search
findSnakes :: Vector Int32 -> Vector Int32 -> MVI1 s ->  Int -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> (Int -> Int -> Int) -> ST s ()
findSnakes a b fp !k !ct cmp op = {-#SCC findSnakes #-} U.forM_ (U.fromList [0..ct-1]) (\x -> gridWalk a b fp (op k x) cmp)
{-#INLINE findSnakes #-}

I added some cost center annotations, and ran profiling with a certain LCS input for testing. Here is what I get:

  total time  =        2.39 secs   (2394 ticks @ 1000 us, 1 processor)
  total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc

gridWalk    Main       67.5   52.7
findSnakes  Main       23.2   27.8
cmp         Main        4.2    0.0
findYP      Main        3.5   19.4
updateFP    Main        1.6    0.0


                                                         individual     inherited
COST CENTRE    MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN           MAIN                     64           0    0.0    0.0   100.0  100.0
 main          Main                    129           0    0.0    0.0     0.0    0.0
 CAF           Main                    127           0    0.0    0.0   100.0  100.0
  findSnakes   Main                    141           0    0.0    0.0     0.0    0.0
  main         Main                    128           1    0.0    0.0   100.0  100.0
   findSnakes  Main                    138           0    0.0    0.0     0.0    0.0
    gridWalk   Main                    139           0    0.0    0.0     0.0    0.0
     cmp       Main                    140           0    0.0    0.0     0.0    0.0
   while       Main                    132        4001    0.1    0.0   100.0  100.0
    findSnakes Main                    133       12000   23.2   27.8    99.9   99.9
     gridWalk  Main                    134    16004000   67.5   52.7    76.7   72.2
      cmp      Main                    137    16004000    4.2    0.0     4.2    0.0
      updateFP Main                    136    16004000    1.6    0.0     1.6    0.0
      findYP   Main                    135    16004000    3.5   19.4     3.5   19.4
   newVI1      Main                    130           1    0.0    0.0     0.0    0.0
   newVI1.\   Main                    131        8004    0.0    0.0     0.0    0.0
 CAF           GHC.Conc.Signal         112           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding         104           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding.Iconv   102           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Handle.FD         95           0    0.0    0.0     0.0    0.0

If I am interpreting the profiling output correctly (and assuming there are not too much distortions due to profiling), gridWalk takes most of the time, but the main functions cmp and findYP that do the heavy lifting in gridWalk, seem to take very little time in profiling report. So, perhaps the bottleneck is in forM_ wrapper that findSnakes function uses to call gridWalk? The heap profile too seems very clean:

Reading the core, nothing really jumps out. I thought some values in inner loops might be boxed, but I don't spot them in the core. I hope the performance issue is due to something simple I have missed.

Update

Following suggestion of @DanielFischer, I replaced forM_ of Data.Vector.Unboxed with that of Control.Monad in findSnakes function which improved the performance from 4x to 2.5x of C version. Haskell and C versions are now posted here if you want to try them out.

I am still digging through the core to see where the bottlenecks are. gridWalk is most frequently called function, and for it to perform well, lcsh should reduce whileM_ loop to a nice iterative inner loop of condition check and inlined findSnakes code. I suspect that in assembly, this is not the case for whileM_ loop, but since I am not very knowledgeable about translating core, and locating name-mangled GHC functions in assembly, I guess it is just a matter of patiently plugging away at the problem until I figure it out. Meanwhile if there are any pointers on performance fixes, they will be appreciated.

Another possibility that I can think of is overhead of heap checks during function calls. As seen in profiling report, gridWalk is called 16004000 times. Assuming 6 cycles for heap check (I am guessing it is less, but still let us assume that), it is ~0.02 seconds on a 3.33GHz box for 96024000 cycles.

Also, some performance numbers:

Haskell code (GHC 7.6.1 x86_64): It was ~0.25s before forM_ fix.

 time ./T
1

real    0m0.150s
user    0m0.145s
sys     0m0.003s

C code (gcc 4.7.2 x86_64):

time ./test
1

real    0m0.065s
user    0m0.063s
sys     0m0.000s

Update 2:

Updated code is here. Using STUArray doesn't change the numbers either. The performance is about 1.5x on Mac OS X (x86_64,ghc7.6.1), pretty similar to what @DanielFischer reported on Linux.

Haskell code:

$ time ./Diff
1

real    0m0.087s
user    0m0.084s
sys 0m0.003s

C code:

$ time ./test
1

real    0m0.056s
user    0m0.053s
sys 0m0.002s

Glancing at the cmm, the call is tail-recursive, and is turned into a loop by llvm. But each new iteration seems to allocate new values which invokes heap check too, and so, might explain the difference in performance. I have to think about how to write the tail recursion in such a way such that no values are allocated across iterations, avoiding heap-check and allocation overhead.

New profiling data? I know you've made some changes and I'm wondering if they affect anything. Optimizing findSnakes is not going to get you to the 1.2x number you are looking for, but based on this profile, I'd look there since the time spend in the body of a single findSnakes call is about 400-500 times the amount of time spent in the body of a gridWalk call, so it seems like low-hanging fruit. — Boyd Stephen Smith Jr.
@BoydStephenSmithJr., I will post new code this evening. The numbers are still 1.45x. It seems GHC/llvm won't turn the tail recursion in this case to a nice iterative loop like C - my guess is heap-check and function call overhead adds to 1.45x performance. It might require a more clever technique, like a stream representation that can turn it into iterative loop. I am going to post the code so that anyone with interest can take a crack at it. I am still trying to figure out how to translate the data structure as F-algebra/co-algebra structure that GHC can optimize into loops. — Sal

Daniel Fischer Daniel Fischer · Accepted Answer · 2013-06-06T09:58:47

You take a huge hit at

U.forM_ (U.fromList [0..ct-1])

in findSnakes. I'm convinced that isn't supposed to happen (ticket?), but that allocates a new Vector to traverse every time findSnakes is called. If you use

Control.Monad.forM_ [0 .. ct-1]

instead, the running time roughly halves, and the allocation drops by a factor of about 500 here. (GHC optimises C.M.forM_ [0 :: Int .. limit] well, the list is eliminated, and what remains is basically a loop.) You can do slightly better by writing the loop yourself.

Some things that cause gratuitous allocation/code size bloat without hurting performance much are

the unused Bool argument of lcsh
the cmp argument to findSnakes and gridWalk; if these are never called with a different comparison than the top-level cmp, that argument leads to unnecessary code duplication.
the general type of while; specialising it to the used type ST s Bool -> ST s () -> ST s () reduces allocation (much), and also running time (slightly, but noticeably, here).

A general word about profiling: Compiling a programme for profiling inhibits many optimisations. In particular for libraries like vector, bytestring or text that make heavy use of fusion, profiling often produces misleading results.

For example, your original code produces here

    total time  =        3.42 secs   (3415 ticks @ 1000 us, 1 processor)
    total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.7   52.7   2176 2432608000
findSnakes  Main       20.0   27.8    682 1281440080
cmp         Main        9.2    0.0    313        16
findYP      Main        4.2   19.4    144 896224000
updateFP    Main        2.7    0.0     91         0

Just adding a bang on the binding of len in gridWalk changes nothing at all in the non-profiling version, but for the profiling version

    total time  =        2.98 secs   (2985 ticks @ 1000 us, 1 processor)
    total alloc = 3,204,404,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.0   32.0   1881 1024256000
findSnakes  Main       22.2   40.0    663 1281440080
cmp         Main        7.2    0.0    214        16
findYP      Main        4.7   28.0    140 896224000
updateFP    Main        2.7    0.0     82         0

it makes a lot of difference. For the version including the changes mentioned above (and the bang on len in gridWalk), the profiling version says

total alloc = 1,923,412,776 bytes  (excludes profiling overheads)

but the non-profiling version

     1,814,424 bytes allocated in the heap
        10,808 bytes copied during GC
        49,064 bytes maximum residency (2 sample(s))
        25,912 bytes maximum slop
             1 MB total memory in use (0 MB lost due to fragmentation)

                                  Tot time (elapsed)  Avg pause  Max pause
Gen  0         2 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.12s  (  0.12s elapsed)
GC      time    0.00s  (  0.00s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.12s  (  0.12s elapsed)

says it allocated 1000-fold less than the profiling version.

For vector and friends code, more reliable for identifying bottlenecks than profiling (unfortunately also much much more time-consuming and difficult) is studying the generated core (or assembly, if you are proficient in reading that).

Concerning the update, the C runs a little slower on my box (gcc-4.7.2, -O3)

$ time ./miltest1

real    0m0.074s
user    0m0.073s
sys     0m0.001s

but the Haskell about the same

$ time ./hsmiller
1

real    0m0.151s
user    0m0.149s
sys     0m0.001s

That is a little faster when compiling via the LLVM backend:

$ time ./hsmiller1

real    0m0.131s
user    0m0.129s
sys     0m0.001s

And when we replace the forM_ with a manual loop,

findSnakes a b fp !k !ct op = go 0
  where
    go x
        | x < ct    = gridWalk a b fp (op k x) >> go (x+1)
        | otherwise = return ()

it gets a bit faster,

$ time ./hsmiller
1

real    0m0.124s
user    0m0.121s
sys     0m0.002s

resp. via LLVM:

$ time ./hsmiller
1

real    0m0.108s
user    0m0.107s
sys     0m0.000s

By and large, the generated core looks fine, one small annoyance was

Main.$wa
  :: forall s.
     GHC.Prim.Int#
     -> GHC.Types.Int
     -> GHC.Prim.State# s
     -> (# GHC.Prim.State# s, Main.MVI1 s #)

and a slightly roundabout implementation. That is fixed by making newVI1 strict in its second argument,

newVI1 n !x = do

Since that isn't called often, the effect on performance is of course negligible.

The meat is the core for lcsh, and that doesn't look too bad. The only boxed things in that are the Ints read from /written to the STRef, and that is inevitable. What's not so pleasant is that the core contains a lot of code duplication, but in my experience, that rarely is a real performance problem, and not all duplicated code survives the code generation.

and for it to perform well, lcsh should reduce whileM_ loop to a nice iterative inner loop of condition check and inlined findSnakes code.

You get an inner loop when you add an INLINE pragma to whileM_, but that loop is not nice, and, in this case it is much slower than having the whileM_ out-of-line (I'm not sure whether it's solely due to code size, but it could be).

Debugging performance bottlenecks for longest common subsequence algorithm

1 Answers