improve function performance

Question

I am putting together a small program that checks for solutions for Brocard's Problem or so called Brown Numbers and I first created a draft in ruby:

class Integer
  def factorial
    f = 1; for i in 1..self; f *= i; end; f
  end
end

boundary = 1000
m = 0

# Brown Numbers - pair of integers (m,n) where n factorial is equal with square root of m

while m <= boundary

    n = 0

    while n <= boundary
        puts "(#{m},#{n})" if ((n.factorial + 1) == (m ** 2)) 
        n += 1
    end

    m += 1
end

But I discovered that Haskell is much more appropriate for doing mathematical operations, therefore I have asked a question previously and I got an answer pretty quick on how to translate my ruby code to Haskell:

results :: [(Integer, Integer)] --Use instead of `Int` to fix overflow issue
results =  [(x,y) | x <- [1..1000], y <- [1..1000] , 1 + fac x == y*y]
    where fac n = product [1..n]

I changed that slightly so I can run the same operation from and up to whatever number I want, because the above will do it from 1 up to 1000 or any hardcoded number but I would like to be able to decide the interval it should go through, ergo:

pairs :: (Integer, Integer) -> [(Integer, Integer)]
pairs (lower, upper) =  [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m] where factorial n = product [1..n]

If possible, I would like some examples or pointers on optimisations for improving the speed of the operations, because at this point if I run this operation for an interval such as [100..10000] it takes a long time (I stopped it after 45mins).

PS The performance optimisations are to be applied to the Haskell implementation of the calculations (pairs function), not the ruby one, in case some may wonder which function I am talking about.

possible duplicate of run haskell operations in parallel or multithreaded — Boyd Stephen Smith Jr.
@rolandjitsu what do you mean "accept that other answer"? If you find that the answer you accepted on that post has been superseded by another, then unaccept the one you have and choose the better one. On SO we want to have one answer for each question accepted, otherwise the site would have given users the ability to accept multiple answers for a single question. — bheklilr
This is an unusual situation. You previously posted a question, got a good, relevant answer, which you accepted. Subsequently, someone else gave a good answer to a question you didn't ask, but you found that answer helpful. You now want to "Jeopardy"-ize that answer by posting the associated question. I don't think that's justified if the only reason is to reward the answerer, but if it's a good question, why not? If you don't do so, it's unlikely anyone will find that answer, and who knows, you may get a better answer. — Cary Swoveland

Zeta Zeta · Accepted Answer · 2014-07-24T18:47:40

Well, how would you speed up the ruby implementation? Even though while they're different languages similar optimizations can be applied, namely memoization, and smarter algorithms.

1. Memoization

Memoization prevents you from calculating the factorial over and over.

Here's your version of pairs:

pairs :: (Integer, Integer) -> [(Integer, Integer)]
pairs (lower, upper) =  [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m]
    where factorial n = product [1..n]

How often does factorial get called? Well, we can say that it gets called at least upper - lower times, although it could be that we don't remember the values from previous calls. In this case, we need (upper - lower)² calls of factorial. Even though a factorial is fairly simple to compute, it doesn't come for free.

What if we instead generate a infinite list of factorials and simply pick the right ones?

pairsMem :: (Integer, Integer) -> [(Integer, Integer)]
pairsMem (lower, upper) =  [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m]
    where factorial  = (factorials!!) . fromInteger
          factorials = scanl (*) 1 [1..]

Now factorials is the list [1,1,2,6,24,…], and factorial simply looks up the corresponding value. How do both versions compare?

Your version

main = print $ pairs (0,1000)

> ghc --make SO.hs -O2 -rtsopts > /dev/null
> ./SO.hs +RTS -s
[(5,4),(11,5),(71,7)]
 204,022,149,768 bytes allocated in the heap
     220,119,948 bytes copied during GC
          41,860 bytes maximum residency (2 sample(s))
          20,308 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     414079 colls,     0 par    2.39s    2.23s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   67.33s  ( 67.70s elapsed)
  GC      time    2.39s  (  2.23s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   69.72s  ( 69.93s elapsed)

  %GC     time       3.4%  (3.2% elapsed)

  Alloc rate    3,030,266,322 bytes per MUT second

  Productivity  96.6% of total user, 96.3% of total elapsed

Around 68 seconds.

`pairsMem`

main = print $ pairsMem (0,1000)

> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO.hs +RTS -s
[(5,4),(11,5),(71,7)]
     551,558,988 bytes allocated in the heap
         644,420 bytes copied during GC
         231,120 bytes maximum residency (2 sample(s))
          71,504 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1159 colls,     0 par    0.00s    0.01s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.17s  (  2.18s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    2.17s  (  2.18s elapsed)

  %GC     time       0.0%  (0.3% elapsed)

  Alloc rate    253,955,217 bytes per MUT second

  Productivity 100.0% of total user, 99.5% of total elapsed

Around two seconds or only 3% of the original time. Not bad for an almost trivial change. However, as you can see, we use twice the memory. After all, we're saving the factorials in a list. However, the total amount of allocated bytes is 0.27% of the non-memoized variant, since we don't need to regenerate the product.

`pairsMem (100,10000)`

What about large numbers? You said that with (100,1000) you stopped after 45 minutes. How fast is the memoized version?

main = print $ pairsMem (100,10000)

> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO.hs +RTS -s
… 20 minutes later Ctrl+C…

That still takes too long. What else can we do?

2. Smarter pairs

Lets go back to the drawing board. You're checking all pairs (n,m) in (lower,upper). Is this reasonable?

Actually, no, since factorials grow tremendously fast. So for any natural number let f(m) be the greatest natural number such that f(m)! <= m. Now, for any m, we only need to check the f(m) first factorials - all other's will be greater.

Just for the record, f(10^100) is 70.

Now the strategy is clear: we generate as many factorials as we need and simply check if m * m - 1 is in the list of factorials:

import Data.Maybe (isJust)
import Data.List (elemIndex)

pairsList :: (Integer, Integer) -> [(Integer, Integer)]
pairsList (lower, upper) = [(m, fromIntegral ret) 
                           | m <- [lower..upper], 
                             let l = elemIndex (m*m - 1) fs,
                             isJust l,
                             let Just ret = l
                           ]
    where fs = takeWhile (<upper*upper) $ scanl (*) 1 [1..]

How well does this version fare against pairsMemLim?

main = print $ pairsList (1, 10^8)

> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO +RTS -s
[(5,4),(11,5),(71,7)]
  21,193,518,276 bytes allocated in the heap
       2,372,136 bytes copied during GC
          58,672 bytes maximum residency (2 sample(s))
          19,580 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     40823 colls,     0 par    0.06s    0.11s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   38.17s  ( 38.15s elapsed)
  GC      time    0.06s  (  0.11s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   38.23s  ( 38.26s elapsed)

  %GC     time       0.2%  (0.3% elapsed)

  Alloc rate    555,212,922 bytes per MUT second

  Productivity  99.8% of total user, 99.8% of total elapsed

Allright, down to 40s. But what if we use a data structure which provides an even more efficient lookup?

3. Using the correct data structure

Since we want efficient lookup, we're going to use Set. The function almost stays the same, however, fs is going to be Set Integer, and the lookup is done via lookupIndex:

import Data.Maybe (isJust)
import qualified Data.Set as S

pairsSet :: (Integer, Integer) -> [(Integer, Integer)]
pairsSet (lower, upper) = [(m, 1 + fromIntegral ret) 
                          | m <- [lower..upper], 
                            let l = S.lookupIndex (m*m - 1) fs,
                            isJust l,
                            let Just ret = l
                          ]
    where fs = S.fromList . takeWhile (<upper*upper) $ scanl (*) 1 [1..]

And here the performance of pairsSet:

 > ./SO +RTS -s
[(5,4),(11,5),(71,7)]
  18,393,520,096 bytes allocated in the heap
       2,069,872 bytes copied during GC
          58,752 bytes maximum residency (2 sample(s))
          19,580 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     35630 colls,     0 par    0.06s    0.08s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   18.52s  ( 18.52s elapsed)
  GC      time    0.06s  (  0.08s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   18.58s  ( 18.60s elapsed)

  %GC     time       0.3%  (0.4% elapsed)

  Alloc rate    993,405,304 bytes per MUT second

  Productivity  99.7% of total user, 99.5% of total elapsed

This concludes our journey through optimization. By the way, we have reduced the complexity from 𝒪(n³) to 𝒪(n log n) since our data structure gives us a logarithmic search.