Well, how would you speed up the ruby implementation? Even though while they're different languages similar optimizations can be applied, namely memoization, and smarter algorithms.
1. Memoization
Memoization prevents you from calculating the factorial over and over.
Here's your version of pairs:
pairs :: (Integer, Integer) -> [(Integer, Integer)]
pairs (lower, upper) = [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m]
where factorial n = product [1..n]
How often does factorial get called? Well, we can say that it gets called at least upper - lower times, although it could be that we don't remember the values from previous calls. In this case, we need (upper - lower)² calls of factorial. Even though a factorial is fairly simple to compute, it doesn't come for free.
What if we instead generate a infinite list of factorials and simply pick the right ones?
pairsMem :: (Integer, Integer) -> [(Integer, Integer)]
pairsMem (lower, upper) = [(m, n) | m <- [lower..upper], n <- [lower..upper], 1 + factorial n == m*m]
where factorial = (factorials!!) . fromInteger
factorials = scanl (*) 1 [1..]
Now factorials is the list [1,1,2,6,24,…], and factorial simply looks up the corresponding value. How do both versions compare?
Your version
main = print $ pairs (0,1000)
> ghc --make SO.hs -O2 -rtsopts > /dev/null
> ./SO.hs +RTS -s
[(5,4),(11,5),(71,7)]
204,022,149,768 bytes allocated in the heap
220,119,948 bytes copied during GC
41,860 bytes maximum residency (2 sample(s))
20,308 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 414079 colls, 0 par 2.39s 2.23s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 67.33s ( 67.70s elapsed)
GC time 2.39s ( 2.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 69.72s ( 69.93s elapsed)
%GC time 3.4% (3.2% elapsed)
Alloc rate 3,030,266,322 bytes per MUT second
Productivity 96.6% of total user, 96.3% of total elapsed
Around 68 seconds.
pairsMem
main = print $ pairsMem (0,1000)
> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO.hs +RTS -s
[(5,4),(11,5),(71,7)]
551,558,988 bytes allocated in the heap
644,420 bytes copied during GC
231,120 bytes maximum residency (2 sample(s))
71,504 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1159 colls, 0 par 0.00s 0.01s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.17s ( 2.18s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.17s ( 2.18s elapsed)
%GC time 0.0% (0.3% elapsed)
Alloc rate 253,955,217 bytes per MUT second
Productivity 100.0% of total user, 99.5% of total elapsed
Around two seconds or only 3% of the original time. Not bad for an almost trivial change. However, as you can see, we use twice the memory. After all, we're saving the factorials in a list. However, the total amount of allocated bytes is 0.27% of the non-memoized variant, since we don't need to regenerate the product.
pairsMem (100,10000)
What about large numbers? You said that with (100,1000) you stopped after 45 minutes. How fast is the memoized version?
main = print $ pairsMem (100,10000)
> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO.hs +RTS -s
… 20 minutes later Ctrl+C…
That still takes too long. What else can we do?
2. Smarter pairs
Lets go back to the drawing board. You're checking all pairs (n,m) in (lower,upper). Is this reasonable?
Actually, no, since factorials grow tremendously fast. So for any natural number let f(m) be the greatest natural number such that f(m)! <= m. Now, for any m, we only need to check the f(m) first factorials - all other's will be greater.
Just for the record, f(10^100) is 70.
Now the strategy is clear: we generate as many factorials as we need and simply check if m * m - 1 is in the list of factorials:
import Data.Maybe (isJust)
import Data.List (elemIndex)
pairsList :: (Integer, Integer) -> [(Integer, Integer)]
pairsList (lower, upper) = [(m, fromIntegral ret)
| m <- [lower..upper],
let l = elemIndex (m*m - 1) fs,
isJust l,
let Just ret = l
]
where fs = takeWhile (<upper*upper) $ scanl (*) 1 [1..]
How well does this version fare against pairsMemLim?
main = print $ pairsList (1, 10^8)
> ghc --make -O2 -rtsopts SO.hs > /dev/null
> ./SO +RTS -s
[(5,4),(11,5),(71,7)]
21,193,518,276 bytes allocated in the heap
2,372,136 bytes copied during GC
58,672 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 40823 colls, 0 par 0.06s 0.11s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 38.17s ( 38.15s elapsed)
GC time 0.06s ( 0.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 38.23s ( 38.26s elapsed)
%GC time 0.2% (0.3% elapsed)
Alloc rate 555,212,922 bytes per MUT second
Productivity 99.8% of total user, 99.8% of total elapsed
Allright, down to 40s. But what if we use a data structure which provides an even more efficient lookup?
3. Using the correct data structure
Since we want efficient lookup, we're going to use Set. The function almost stays the same, however, fs is going to be Set Integer, and the lookup is done via lookupIndex:
import Data.Maybe (isJust)
import qualified Data.Set as S
pairsSet :: (Integer, Integer) -> [(Integer, Integer)]
pairsSet (lower, upper) = [(m, 1 + fromIntegral ret)
| m <- [lower..upper],
let l = S.lookupIndex (m*m - 1) fs,
isJust l,
let Just ret = l
]
where fs = S.fromList . takeWhile (<upper*upper) $ scanl (*) 1 [1..]
And here the performance of pairsSet:
> ./SO +RTS -s
[(5,4),(11,5),(71,7)]
18,393,520,096 bytes allocated in the heap
2,069,872 bytes copied during GC
58,752 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 35630 colls, 0 par 0.06s 0.08s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 18.52s ( 18.52s elapsed)
GC time 0.06s ( 0.08s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.58s ( 18.60s elapsed)
%GC time 0.3% (0.4% elapsed)
Alloc rate 993,405,304 bytes per MUT second
Productivity 99.7% of total user, 99.5% of total elapsed
This concludes our journey through optimization. By the way, we have reduced the complexity from 𝒪(n³) to 𝒪(n log n) since our data structure gives us a logarithmic search.