Speed comparison with Project Euler: C vs Python vs Erlang vs Haskell

702

votes

I have taken Problem #12 from Project Euler as a programming exercise and to compare my (surely not optimal) implementations in C, Python, Erlang and Haskell. In order to get some higher execution times, I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem.

The result is the following:

lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320

real    0m11.074s
user    0m11.070s
sys 0m0.000s

Python:

lorenzo@enzo:~/erlang$ time ./euler12.py 
842161320

real    1m16.632s
user    1m16.370s
sys 0m0.250s

Python with PyPy:

lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py 
842161320

real    0m13.082s
user    0m13.050s
sys 0m0.020s

Erlang:

lorenzo@enzo:~/erlang$ erlc euler12.erl 
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.4  (abort with ^G)
1> 842161320

real    0m48.259s
user    0m48.070s
sys 0m0.020s

Haskell:

lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx 
842161320

real    2m37.326s
user    2m37.240s
sys 0m0.080s

Summary:

C: 100%
Python: 692% (118% with PyPy)
Erlang: 436% (135% thanks to RichardC)
Haskell: 1421%

I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three. Also it doesn't need to load a runtime first (Do the others?).

Question 1: Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

Question 2: Why is Haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as Haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

EDIT:

Question 4: Do my functional implementations permit LCO (last call optimization, a.k.a tail recursion elimination) and hence avoid adding unnecessary frames onto the call stack?

I really tried to implement the same algorithm as similar as possible in the four languages, although I have to admit that my Haskell and Erlang knowledge is very limited.

Source codes used:

#include <stdio.h>
#include <math.h>

int factorCount (long n)
{
    double square = sqrt (n);
    int isquare = (int) square;
    int count = isquare == square ? -1 : 0;
    long candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    long triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index ++;
        triangle += index;
    }
    printf ("%ld\n", triangle);
}

#! /usr/bin/env python3.2

import math

def factorCount (n):
    square = math.sqrt (n)
    isquare = int (square)
    count = -1 if isquare == square else 0
    for candidate in range (1, isquare + 1):
        if not n % candidate: count += 2
    return count

triangle = 1
index = 1
while factorCount (triangle) < 1001:
    index += 1
    triangle += index

print (triangle)

-module (euler12).
-compile (export_all).

factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).

factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

factorCount (Number, Sqrt, Candidate, Count) ->
    case Number rem Candidate of
        0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
        _ -> factorCount (Number, Sqrt, Candidate + 1, Count)
    end.

nextTriangle (Index, Triangle) ->
    Count = factorCount (Triangle),
    if
        Count > 1000 -> Triangle;
        true -> nextTriangle (Index + 1, Triangle + Index + 1)  
    end.

solve () ->
    io:format ("~p~n", [nextTriangle (1, 1) ] ),
    halt (0).

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' number sqrt candidate count
    | fromIntegral candidate > sqrt = count
    | number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
    | otherwise = factorCount' number sqrt (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

pythonc performancehaskellerlang

@Jochen (and Seth) Not really that C is fast or awesome, but it is perceived as easy to write performant code (that might not be true, but most programs seem to be able, so true enough). As I explore in my answer, and have found to be true over time, the programmer skill and knowledge of common optimizations for the chosen language is of great importance (especially so for Haskell). - Thomas M. DuBuisson

Just checked with Mathematica -- it takes 0.25sec (with C it takes 6sec here), and the code is just: Euler12[x_Integer] := Module[{s = 1}, For[i = 2, DivisorSigma[0, s] < x, i++, s += i]; s]. hurray! - tsvikas

Is there anyone else out there that remembers these wars between C and assembly? "Sure! You can write your code 10x faster in C, but can your C code run this quick?..." I'm sure the same battles were fought between machine-code and assembly. - JS.

@JS: Probably not, as assembly is simply a set of mnemonics that you type instead of the raw binary machine code - normally there is a 1-1 correspondence between them. - Callum Rogers

the conclusion, for Haskell: -O2 gives it about 3x a speedup, and using Int instead of Integer about 4x-6x for the total speedup of 12x-14x and more. - Will Ness

18 Answers

825

votes

Using GHC 7.0.3, gcc 4.4.6, Linux 2.6.29 on an x86_64 Core2 Duo (2.5GHz) machine, compiling using ghc -O2 -fllvm -fforce-recomp for Haskell and gcc -O3 -lm for C.

Your C routine runs in 8.4 seconds (faster than your run probably because of -O3)
The Haskell solution runs in 36 seconds (due to the -O2 flag)
Your factorCount' code isn't explicitly typed and defaulting to Integer (thanks to Daniel for correcting my misdiagnosis here!). Giving an explicit type signature (which is standard practice anyway) using Int and the time changes to 11.1 seconds
in factorCount' you have needlessly called fromIntegral. A fix results in no change though (the compiler is smart, lucky for you).
You used mod where rem is faster and sufficient. This changes the time to 8.5 seconds.
factorCount' is constantly applying two extra arguments that never change (number, sqrt). A worker/wrapper transformation gives us:

 $ time ./so
 842161320  

 real    0m7.954s  
 user    0m7.944s  
 sys     0m0.004s

That's right, 7.95 seconds. Consistently half a second faster than the C solution. Without the -fllvm flag I'm still getting 8.182 seconds, so the NCG backend is doing well in this case too.

Conclusion: Haskell is awesome.

Resulting Code

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' :: Int -> Int -> Int -> Int -> Int
factorCount' number sqrt candidate0 count0 = go candidate0 count0
  where
  go candidate count
    | candidate > sqrt = count
    | number `rem` candidate == 0 = go (candidate + 1) (count + 2)
    | otherwise = go (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

EDIT: So now that we've explored that, lets address the questions

Question 1: Do erlang, python and haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

In Haskell, using Integer is slower than Int but how much slower depends on the computations performed. Luckily (for 64 bit machines) Int is sufficient. For portability sake you should probably rewrite my code to use Int64 or Word64 (C isn't the only language with a long).

Question 2: Why is haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

That was what I answered above. The answer was

0) Use optimization via -O2
1) Use fast (notably: unbox-able) types when possible
2) rem not mod (a frequently forgotten optimization) and
3) worker/wrapper transformation (perhaps the most common optimization).

Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

Yes, that wasn't the issue. Good work and glad you considered this.

227

votes

There are some problems with the Erlang implementation. As baseline for the following, my measured execution time for your unmodified Erlang program was 47.6 seconds, compared to 12.7 seconds for the C code.

The first thing you should do if you want to run computationally intensive Erlang code is to use native code. Compiling with erlc +native euler12 got the time down to 41.3 seconds. This is however a much lower speedup (just 15%) than expected from native compilation on this kind of code, and the problem is your use of -compile(export_all). This is useful for experimentation, but the fact that all functions are potentially reachable from the outside causes the native compiler to be very conservative. (The normal BEAM emulator is not that much affected.) Replacing this declaration with -export([solve/0]). gives a much better speedup: 31.5 seconds (almost 35% from the baseline).

But the code itself has a problem: for each iteration in the factorCount loop, you perform this test:

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

The C code doesn't do this. In general, it can be tricky to make a fair comparison between different implementations of the same code, and in particular if the algorithm is numerical, because you need to be sure that they are actually doing the same thing. A slight rounding error in one implementation due to some typecast somewhere may cause it to do many more iterations than the other even though both eventually reach the same result.

To eliminate this possible error source (and get rid of the extra test in each iteration), I rewrote the factorCount function as follows, closely modelled on the C code:

factorCount (N) ->
    Sqrt = math:sqrt (N),
    ISqrt = trunc(Sqrt),
    if ISqrt == Sqrt -> factorCount (N, ISqrt, 1, -1);
       true          -> factorCount (N, ISqrt, 1, 0)
    end.

factorCount (_N, ISqrt, Candidate, Count) when Candidate > ISqrt -> Count;
factorCount ( N, ISqrt, Candidate, Count) ->
    case N rem Candidate of
        0 -> factorCount (N, ISqrt, Candidate + 1, Count + 2);
        _ -> factorCount (N, ISqrt, Candidate + 1, Count)
    end.

This rewrite, no export_all, and native compilation, gave me the following run time:

$ erlc +native euler12.erl
$ time erl -noshell -s euler12 solve
842161320

real    0m19.468s
user    0m19.450s
sys 0m0.010s

which is not too bad compared to the C code:

$ time ./a.out 
842161320

real    0m12.755s
user    0m12.730s
sys 0m0.020s

considering that Erlang is not at all geared towards writing numerical code, being only 50% slower than C on a program like this is pretty good.

Finally, regarding your questions:

Question 1: Do erlang, python and haskell loose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

Yes, somewhat. In Erlang, there is no way of saying "use 32/64-bit arithmetic with wrap-around", so unless the compiler can prove some bounds on your integers (and it usually can't), it must check all computations to see if they can fit in a single tagged word or if it has to turn them into heap-allocated bignums. Even if no bignums are ever used in practice at runtime, these checks will have to be performed. On the other hand, that means you know that the algorithm will never fail because of an unexpected integer wraparound if you suddenly give it larger inputs than before.

Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

Yes, your Erlang code is correct with respect to last call optimization.

159

votes

In regards to Python optimization, in addition to using PyPy (for pretty impressive speed-ups with zero change to your code), you could use PyPy's translation toolchain to compile an RPython-compliant version, or Cython to build an extension module, both of which are faster than the C version in my testing, with the Cython module nearly twice as fast. For reference I include C and PyPy benchmark results as well:

C (compiled with gcc -O3 -lm)

% time ./euler12-c 
842161320

./euler12-c  11.95s 
 user 0.00s 
 system 99% 
 cpu 11.959 total

PyPy 1.5

% time pypy euler12.py
842161320
pypy euler12.py  
16.44s user 
0.01s system 
99% cpu 16.449 total

RPython (using latest PyPy revision, c2f583445aee)

% time ./euler12-rpython-c
842161320
./euler12-rpy-c  
10.54s user 0.00s 
system 99% 
cpu 10.540 total

Cython 0.15

% time python euler12-cython.py
842161320
python euler12-cython.py  
6.27s user 0.00s 
system 99% 
cpu 6.274 total

The RPython version has a couple of key changes. To translate into a standalone program you need to define your target, which in this case is the main function. It's expected to accept sys.argv as it's only argument, and is required to return an int. You can translate it by using translate.py, % translate.py euler12-rpython.py which translates to C and compiles it for you.

# euler12-rpython.py

import math, sys

def factorCount(n):
    square = math.sqrt(n)
    isquare = int(square)
    count = -1 if isquare == square else 0
    for candidate in xrange(1, isquare + 1):
        if not n % candidate: count += 2
    return count

def main(argv):
    triangle = 1
    index = 1
    while factorCount(triangle) < 1001:
        index += 1
        triangle += index
    print triangle
    return 0

if __name__ == '__main__':
    main(sys.argv)

def target(*args):
    return main, None

The Cython version was rewritten as an extension module _euler12.pyx, which I import and call from a normal python file. The _euler12.pyx is essentially the same as your version, with some additional static type declarations. The setup.py has the normal boilerplate to build the extension, using python setup.py build_ext --inplace.

# _euler12.pyx
from libc.math cimport sqrt

cdef int factorCount(int n):
    cdef int candidate, isquare, count
    cdef double square
    square = sqrt(n)
    isquare = int(square)
    count = -1 if isquare == square else 0
    for candidate in range(1, isquare + 1):
        if not n % candidate: count += 2
    return count

cpdef main():
    cdef int triangle = 1, index = 1
    while factorCount(triangle) < 1001:
        index += 1
        triangle += index
    print triangle

# euler12-cython.py
import _euler12
_euler12.main()

# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("_euler12", ["_euler12.pyx"])]

setup(
  name = 'Euler12-Cython',
  cmdclass = {'build_ext': build_ext},
  ext_modules = ext_modules
)

I honestly have very little experience with either RPython or Cython, and was pleasantly surprised at the results. If you are using CPython, writing your CPU-intensive bits of code in a Cython extension module seems like a really easy way to optimize your program.

votes

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

The C implementation is suboptimal (as hinted at by Thomas M. DuBuisson), the version uses 64-bit integers (i.e. long datatype). I'll investigate the assembly listing later, but with an educated guess, there are some memory accesses going on in the compiled code, which make using 64-bit integers significantly slower. It's that or generated code (be it the fact that you can fit less 64-bit ints in a SSE register or round a double to a 64-bit integer is slower).

Here is the modified code (simply replace long with int and I explicitly inlined factorCount, although I do not think that this is necessary with gcc -O3):

#include <stdio.h>
#include <math.h>

static inline int factorCount(int n)
{
    double square = sqrt (n);
    int isquare = (int)square;
    int count = isquare == square ? -1 : 0;
    int candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    int triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index++;
        triangle += index;
    }
    printf ("%d\n", triangle);
}

Running + timing it gives:

$ gcc -O3 -lm -o euler12 euler12.c; time ./euler12
842161320
./euler12  2.95s user 0.00s system 99% cpu 2.956 total

For reference, the haskell implementation by Thomas in the earlier answer gives:

$ ghc -O2 -fllvm -fforce-recomp euler12.hs; time ./euler12                                                                                      [9:40]
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12 ...
842161320
./euler12  9.43s user 0.13s system 99% cpu 9.602 total

Conclusion: Taking nothing away from ghc, its a great compiler, but gcc normally generates faster code.

votes

Take a look at this blog. Over the past year or so he's done a few of the Project Euler problems in Haskell and Python, and he's generally found Haskell to be much faster. I think that between those languages it has more to do with your fluency and coding style.

When it comes to Python speed, you're using the wrong implementation! Try PyPy, and for things like this you'll find it to be much, much faster.

votes

Your Haskell implementation could be greatly sped up by using some functions from Haskell packages. In this case I used primes, which is just installed with 'cabal install primes' ;)

import Data.Numbers.Primes
import Data.List

triangleNumbers = scanl1 (+) [1..]
nDivisors n = product $ map ((+1) . length) (group (primeFactors n))
answer = head $ filter ((> 500) . nDivisors) triangleNumbers

main :: IO ()
main = putStrLn $ "First triangle number to have over 500 divisors: " ++ (show answer)

Timings:

Your original program:

PS> measure-command { bin\012_slow.exe }

TotalSeconds      : 16.3807409
TotalMilliseconds : 16380.7409

Improved implementation

PS> measure-command { bin\012.exe }

TotalSeconds      : 0.0383436
TotalMilliseconds : 38.3436

As you can see, this one runs in 38 milliseconds on the same machine where yours ran in 16 seconds :)

Compilation commands:

ghc -O2 012.hs -o bin\012.exe
ghc -O2 012_slow.hs -o bin\012_slow.exe

votes

Just for fun. The following is a more 'native' Haskell implementation:

import Control.Applicative
import Control.Monad
import Data.Either
import Math.NumberTheory.Powers.Squares

isInt :: RealFrac c => c -> Bool
isInt = (==) <$> id <*> fromInteger . round

intSqrt :: (Integral a) => a -> Int
--intSqrt = fromIntegral . floor . sqrt . fromIntegral
intSqrt = fromIntegral . integerSquareRoot'

factorize :: Int -> [Int]
factorize 1 = []
factorize n = first : factorize (quot n first)
  where first = (!! 0) $ [a | a <- [2..intSqrt n], rem n a == 0] ++ [n]

factorize2 :: Int -> [(Int,Int)]
factorize2 = foldl (\ls@((val,freq):xs) y -> if val == y then (val,freq+1):xs else (y,1):ls) [(0,0)] . factorize

numDivisors :: Int -> Int
numDivisors = foldl (\acc (_,y) -> acc * (y+1)) 1 <$> factorize2

nextTriangleNumber :: (Int,Int) -> (Int,Int)
nextTriangleNumber (n,acc) = (n+1,acc+n+1)

forward :: Int -> (Int, Int) -> Either (Int, Int) (Int, Int)
forward k val@(n,acc) = if numDivisors acc > k then Left val else Right (nextTriangleNumber val)

problem12 :: Int -> (Int, Int)
problem12 n = (!!0) . lefts . scanl (>>=) (forward n (1,1)) . repeat . forward $ n

main = do
  let (n,val) = problem12 1000
  print val

Using ghc -O3, this consistently runs in 0.55-0.58 seconds on my machine (1.73GHz Core i7).

A more efficient factorCount function for the C version:

int factorCount (int n)
{
  int count = 1;
  int candidate,tmpCount;
  while (n % 2 == 0) {
    count++;
    n /= 2;
  }
    for (candidate = 3; candidate < n && candidate * candidate < n; candidate += 2)
    if (n % candidate == 0) {
      tmpCount = 1;
      do {
        tmpCount++;
        n /= candidate;
      } while (n % candidate == 0);
       count*=tmpCount;
      }
  if (n > 1)
    count *= 2;
  return count;
}

Changing longs to ints in main, using gcc -O3 -lm, this consistently runs in 0.31-0.35 seconds.

Both can be made to run even faster if you take advantage of the fact that the nth triangle number = n*(n+1)/2, and n and (n+1) have completely disparate prime factorizations, so the number of factors of each half can be multiplied to find the number of factors of the whole. The following:

int main ()
{
  int triangle = 0,count1,count2 = 1;
  do {
    count1 = count2;
    count2 = ++triangle % 2 == 0 ? factorCount(triangle+1) : factorCount((triangle+1)/2);
  } while (count1*count2 < 1001);
  printf ("%lld\n", ((long long)triangle)*(triangle+1)/2);
}

will reduce the c code run time to 0.17-0.19 seconds, and it can handle much larger searches -- greater than 10000 factors takes about 43 seconds on my machine. I leave a similar haskell speedup to the interested reader.

votes

Question 1: Do erlang, python and haskell loose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

This is unlikely. I cannot say much about Erlang and Haskell (well, maybe a bit about Haskell below) but I can point a lot of other bottlenecks in Python. Every time the program tries to execute an operation with some values in Python, it should verify whether the values are from the proper type, and it costs a bit of time. Your factorCount function just allocates a list with range (1, isquare + 1) various times, and runtime, malloc-styled memory allocation is way slower than iterating on a range with a counter as you do in C. Notably, the factorCount() is called multiple times and so allocates a lot of lists. Also, let us not forget that Python is interpreted and the CPython interpreter has no great focus on being optimized.

EDIT: oh, well, I note that you are using Python 3 so range() does not return a list, but a generator. In this case, my point about allocating lists is half-wrong: the function just allocates range objects, which are inefficient nonetheless but not as inefficient as allocating a list with a lot of items.

Question 2: Why is haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as haskell is a book with seven seals to me.)

Are you using Hugs? Hugs is a considerably slow interpreter. If you are using it, maybe you can get a better time with GHC - but I am only cogitating hypotesis, the kind of stuff a good Haskell compiler does under the hood is pretty fascinating and way beyond my comprehension :)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

I'd say you are playing an unfunny game. The best part of knowing various languages is to use them the most different way possible :) But I digress, I just do not have any recommendation for this point. Sorry, I hope someone can help you in this case :)

Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

As far as I remember, you just need to make sure that your recursive call is the last command before returning a value. In other words, a function like the one below could use such optimization:

def factorial(n, acc=1):
    if n > 1:
        acc = acc * n
        n = n - 1
        return factorial(n, acc)
    else:
        return acc

However, you would not have such optimization if your function were such as the one below, because there is an operation (multiplication) after the recursive call:

def factorial2(n):
    if n > 1:
        f = factorial2(n-1)
        return f*n
    else:
        return 1

I separated the operations in some local variables for make it clear which operations are executed. However, the most usual is to see these functions as below, but they are equivalent for the point I am making:

def factorial(n, acc=1):
    if n > 1:
        return factorial(n-1, acc*n)
    else:
        return acc

def factorial2(n):
    if n > 1:
        return n*factorial(n-1)
    else:
        return 1

Note that it is up to the compiler/interpreter to decide if it will make tail recursion. For example, the Python interpreter does not do it if I remember well (I used Python in my example only because of its fluent syntax). Anyway, if you find strange stuff such as factorial functions with two parameters (and one of the parameters has names such as acc, accumulator etc.) now you know why people do it :)

votes

With Haskell, you really don't need to think in recursions explicitly.

factorCount number = foldr factorCount' 0 [1..isquare] -
                     (fromEnum $ square == fromIntegral isquare)
    where
      square = sqrt $ fromIntegral number
      isquare = floor square
      factorCount' candidate
        | number `rem` candidate == 0 = (2 +)
        | otherwise = id

triangles :: [Int]
triangles = scanl1 (+) [1,2..]

main = print . head $ dropWhile ((< 1001) . factorCount) triangles

In the above code, I have replaced explicit recursions in @Thomas' answer with common list operations. The code still does exactly the same thing without us worrying about tail recursion. It runs (~ 7.49s) about 6% slower than the version in @Thomas' answer (~ 7.04s) on my machine with GHC 7.6.2, while the C version from @Raedwulf runs ~ 3.15s. It seems GHC has improved over the year.

PS. I know it is an old question, and I stumble upon it from google searches (I forgot what I was searching, now...). Just wanted to comment on the question about LCO and express my feelings about Haskell in general. I wanted to comment on the top answer, but comments do not allow code blocks.

votes

Looking at your Erlang implementation. The timing has included the start up of the entire virtual machine, running your program and halting the virtual machine. Am pretty sure that setting up and halting the erlang vm takes some time.

If the timing was done within the erlang virtual machine itself, results would be different as in that case we would have the actual time for only the program in question. Otherwise, i believe that the total time taken by the process of starting and loading of the Erlang Vm plus that of halting it (as you put it in your program) are all included in the total time which the method you are using to time the program is outputting. Consider using the erlang timing itself which we use when we want to time our programs within the virtual machine itself timer:tc/1 or timer:tc/2 or timer:tc/3. In this way, the results from erlang will exclude the time taken to start and stop/kill/halt the virtual machine. That is my reasoning there, think about it, and then try your bench mark again.

I actually suggest that we try to time the program (for languages that have a runtime), within the runtime of those languages in order to get a precise value. C for example has no overhead of starting and shutting down a runtime system as does Erlang, Python and Haskell (98% sure of this - i stand correction). So (based on this reasoning) i conclude by saying that this benchmark wasnot precise /fair enough for languages running on top of a runtime system. Lets do it again with these changes.

EDIT: besides even if all the languages had runtime systems, the overhead of starting each and halting it would differ. so i suggest we time from within the runtime systems (for the languages for which this applies). The Erlang VM is known to have considerable overhead at start up!

votes

Some more numbers and explanations for the C version. Apparently noone did it during all those years. Remember to upvote this answer so it can get on top for everyone to see and learn.

Step One: Benchmark of the author's programs

Laptop Specifications:

CPU i3 M380 (931 MHz - maximum battery saving mode)
4GB memory
Win7 64 bits
Microsoft Visual Studio 2012 Ultimate
Cygwin with gcc 4.9.3
Python 2.7.10

Commands:

compiling on VS x64 command prompt > `for /f %f in ('dir /b *.c') do cl /O2 /Ot /Ox %f -o %f_x64_vs2012.exe`
compiling on cygwin with gcc x64   > `for f in ./*.c; do gcc -m64 -O3 $f -o ${f}_x64_gcc.exe ; done`
time (unix tools) using cygwin > `for f in ./*.exe; do  echo "----------"; echo $f ; time $f ; done`

----------
$ time python ./original.py

real    2m17.748s
user    2m15.783s
sys     0m0.093s
----------
$ time ./original_x86_vs2012.exe

real    0m8.377s
user    0m0.015s
sys     0m0.000s
----------
$ time ./original_x64_vs2012.exe

real    0m8.408s
user    0m0.000s
sys     0m0.015s
----------
$ time ./original_x64_gcc.exe

real    0m20.951s
user    0m20.732s
sys     0m0.030s

Filenames are: integertype_architecture_compiler.exe

integertype is the same as the original program for now (more on that later)
architecture is x86 or x64 depending on the compiler settings
compiler is gcc or vs2012

Step Two: Investigate, Improve and Benchmark Again

VS is 250% faster than gcc. The two compilers should give a similar speed. Obviously, something is wrong with the code or the compiler options. Let's investigate!

The first point of interest is the integer types. Conversions can be expensive and consistency is important for better code generation & optimizations. All integers should be the same type.

It's a mixed mess of int and long right now. We're going to improve that. What type to use? The fastest. Gotta benchmark them'all!

----------
$ time ./int_x86_vs2012.exe

real    0m8.440s
user    0m0.016s
sys     0m0.015s
----------
$ time ./int_x64_vs2012.exe

real    0m8.408s
user    0m0.016s
sys     0m0.015s
----------
$ time ./int32_x86_vs2012.exe

real    0m8.408s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int32_x64_vs2012.exe

real    0m8.362s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int64_x86_vs2012.exe

real    0m18.112s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int64_x64_vs2012.exe

real    0m18.611s
user    0m0.000s
sys     0m0.015s
----------
$ time ./long_x86_vs2012.exe

real    0m8.393s
user    0m0.015s
sys     0m0.000s
----------
$ time ./long_x64_vs2012.exe

real    0m8.440s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint32_x86_vs2012.exe

real    0m8.362s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint32_x64_vs2012.exe

real    0m8.393s
user    0m0.015s
sys     0m0.015s
----------
$ time ./uint64_x86_vs2012.exe

real    0m15.428s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint64_x64_vs2012.exe

real    0m15.725s
user    0m0.015s
sys     0m0.015s
----------
$ time ./int_x64_gcc.exe

real    0m8.531s
user    0m8.329s
sys     0m0.015s
----------
$ time ./int32_x64_gcc.exe

real    0m8.471s
user    0m8.345s
sys     0m0.000s
----------
$ time ./int64_x64_gcc.exe

real    0m20.264s
user    0m20.186s
sys     0m0.015s
----------
$ time ./long_x64_gcc.exe

real    0m20.935s
user    0m20.809s
sys     0m0.015s
----------
$ time ./uint32_x64_gcc.exe

real    0m8.393s
user    0m8.346s
sys     0m0.015s
----------
$ time ./uint64_x64_gcc.exe

real    0m16.973s
user    0m16.879s
sys     0m0.030s

Integer types are int long int32_t uint32_t int64_t and uint64_t from #include <stdint.h>

There are LOTS of integer types in C, plus some signed/unsigned to play with, plus the choice to compile as x86 or x64 (not to be confused with the actual integer size). That is a lot of versions to compile and run ^^

Step Three: Understanding the Numbers

Definitive conclusions:

32 bits integers are ~200% faster than 64 bits equivalents
unsigned 64 bits integers are 25 % faster than signed 64 bits (Unfortunately, I have no explanation for that)

Trick question: "What are the sizes of int and long in C?"
The right answer is: The size of int and long in C are not well-defined!

From the C spec:

int is at least 32 bits
long is at least an int

From the gcc man page (-m32 and -m64 flags):

The 32-bit environment sets int, long and pointer to 32 bits and generates code that runs on any i386 system.
The 64-bit environment sets int to 32 bits and long and pointer to 64 bits and generates code for AMD’s x86-64 architecture.

From MSDN documentation (Data Type Ranges) https://msdn.microsoft.com/en-us/library/s3f49ktz%28v=vs.110%29.aspx :

int, 4 bytes, also knows as signed
long, 4 bytes, also knows as long int and signed long int

To Conclude This: Lessons Learned

32 bits integers are faster than 64 bits integers.
Standard integers types are not well defined in C nor C++, they vary depending on compilers and architectures. When you need consistency and predictability, use the uint32_t integer family from #include <stdint.h>.
Speed issues solved. All other languages are back hundreds percent behind, C & C++ win again ! They always do. The next improvement will be multithreading using OpenMP :D

votes

Question 1: Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

Question one can be answered in the negative for Erlang. The last question is answered by using Erlang appropriately, as in:

http://bredsaal.dk/learning-erlang-using-projecteuler-net

Since it's faster than your initial C example, I would guess there are numerous problems as others have already covered in detail.

This Erlang module executes on a cheap netbook in about 5 seconds ... It uses the network threads model in erlang and, as such demonstrates how to take advantage of the event model. It could be distributed over many nodes. And it's fast. Not my code.

-module(p12dist).  
-author("Jannich Brendle, [email protected], http://blog.bredsaal.dk").  
-compile(export_all).

server() ->  
  server(1).

server(Number) ->  
  receive {getwork, Worker_PID} -> Worker_PID ! {work,Number,Number+100},  
  server(Number+101);  
  {result,T} -> io:format("The result is: \~w.\~n", [T]);  
  _ -> server(Number)  
  end.

worker(Server_PID) ->  
  Server_PID ! {getwork, self()},  
  receive {work,Start,End} -> solve(Start,End,Server_PID)  
  end,  
  worker(Server_PID).

start() ->  
  Server_PID = spawn(p12dist, server, []),  
  spawn(p12dist, worker, [Server_PID]),  
  spawn(p12dist, worker, [Server_PID]),  
  spawn(p12dist, worker, [Server_PID]),  
  spawn(p12dist, worker, [Server_PID]).

solve(N,End,_) when N =:= End -> no_solution;

solve(N,End,Server_PID) ->  
  T=round(N*(N+1)/2),
  case (divisor(T,round(math:sqrt(T))) > 500) of  
    true ->  
      Server_PID ! {result,T};  
    false ->  
      solve(N+1,End,Server_PID)  
  end.

divisors(N) ->  
  divisor(N,round(math:sqrt(N))).

divisor(_,0) -> 1;  
divisor(N,I) ->  
  case (N rem I) =:= 0 of  
  true ->  
    2+divisor(N,I-1);  
  false ->  
    divisor(N,I-1)  
  end.

The test below took place on an: Intel(R) Atom(TM) CPU N270 @ 1.60GHz

~$ time erl -noshell -s p12dist start

The result is: 76576500.

^C

BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
a

real    0m5.510s
user    0m5.836s
sys 0m0.152s

votes

Trying GO:

package main

import "fmt"
import "math"

func main() {
    var n, m, c int
    for i := 1; ; i++ {
        n, m, c = i * (i + 1) / 2, int(math.Sqrt(float64(n))), 0
        for f := 1; f < m; f++ {
            if n % f == 0 { c++ }
    }
    c *= 2
    if m * m == n { c ++ }
    if c > 1001 {
        fmt.Println(n)
        break
        }
    }
}

I get:

original c version: 9.1690 100%
go: 8.2520 111%

But using:

package main

import (
    "math"
    "fmt"
 )

// Sieve of Eratosthenes
func PrimesBelow(limit int) []int {
    switch {
        case limit < 2:
            return []int{}
        case limit == 2:
            return []int{2}
    }
    sievebound := (limit - 1) / 2
    sieve := make([]bool, sievebound+1)
    crosslimit := int(math.Sqrt(float64(limit))-1) / 2
    for i := 1; i <= crosslimit; i++ {
        if !sieve[i] {
            for j := 2 * i * (i + 1); j <= sievebound; j += 2*i + 1 {
                sieve[j] = true
            }
        }
    }
    plimit := int(1.3*float64(limit)) / int(math.Log(float64(limit)))
    primes := make([]int, plimit)
    p := 1
    primes[0] = 2
    for i := 1; i <= sievebound; i++ {
        if !sieve[i] {
            primes[p] = 2*i + 1
            p++
            if p >= plimit {
                break
            }
        }
    }
    last := len(primes) - 1
    for i := last; i > 0; i-- {
        if primes[i] != 0 {
            break
        }
        last = i
    }
    return primes[0:last]
}



func main() {
    fmt.Println(p12())
}
// Requires PrimesBelow from utils.go
func p12() int {
    n, dn, cnt := 3, 2, 0
    primearray := PrimesBelow(1000000)
    for cnt <= 1001 {
        n++
        n1 := n
        if n1%2 == 0 {
            n1 /= 2
        }
        dn1 := 1
        for i := 0; i < len(primearray); i++ {
            if primearray[i]*primearray[i] > n1 {
                dn1 *= 2
                break
            }
            exponent := 1
            for n1%primearray[i] == 0 {
                exponent++
                n1 /= primearray[i]
            }
            if exponent > 1 {
                dn1 *= exponent
            }
            if n1 == 1 {
                break
            }
        }
        cnt = dn * dn1
        dn = dn1
    }
    return n * (n - 1) / 2
}

I get:

original c version: 9.1690 100%
thaumkid's c version: 0.1060 8650%
first go version: 8.2520 111%
second go version: 0.0230 39865%

I also tried Python3.6 and pypy3.3-5.5-alpha:

original c version: 8.629 100%
thaumkid's c version: 0.109 7916%
Python3.6: 54.795 16%
pypy3.3-5.5-alpha: 13.291 65%

and then with following code I got:

original c version: 8.629 100%
thaumkid's c version: 0.109 8650%
Python3.6: 1.489 580%
pypy3.3-5.5-alpha: 0.582 1483%

def D(N):
    if N == 1: return 1
    sqrtN = int(N ** 0.5)
    nf = 1
    for d in range(2, sqrtN + 1):
        if N % d == 0:
            nf = nf + 1
    return 2 * nf - (1 if sqrtN**2 == N else 0)

L = 1000
Dt, n = 0, 0

while Dt <= L:
    t = n * (n + 1) // 2
    Dt = D(n/2)*D(n+1) if n%2 == 0 else D(n)*D((n+1)/2)
    n = n + 1

print (t)

votes

C++11, < 20ms for me - Run it here

I understand that you want tips to help improve your language specific knowledge, but since that has been well covered here, I thought I would add some context for people who may have looked at the mathematica comment on your question, etc, and wondered why this code was so much slower.

This answer is mainly to provide context to hopefully help people evaluate the code in your question / other answers more easily.

This code uses only a couple of (uglyish) optimisations, unrelated to the language used, based on:

every traingle number is of the form n(n+1)/2
n and n+1 are coprime
the number of divisors is a multiplicative function

#include <iostream>
#include <cmath>
#include <tuple>
#include <chrono>

using namespace std;

// Calculates the divisors of an integer by determining its prime factorisation.

int get_divisors(long long n)
{
    int divisors_count = 1;

    for(long long i = 2;
        i <= sqrt(n);
        /* empty */)
    {
        int divisions = 0;
        while(n % i == 0)
        {
            n /= i;
            divisions++;
        }

        divisors_count *= (divisions + 1);

        //here, we try to iterate more efficiently by skipping
        //obvious non-primes like 4, 6, etc
        if(i == 2)
            i++;
        else
            i += 2;
    }

    if(n != 1) //n is a prime
        return divisors_count * 2;
    else
        return divisors_count;
}

long long euler12()
{
    //n and n + 1
    long long n, n_p_1;

    n = 1; n_p_1 = 2;

    // divisors_x will store either the divisors of x or x/2
    // (the later iff x is divisible by two)
    long long divisors_n = 1;
    long long divisors_n_p_1 = 2;

    for(;;)
    {
        /* This loop has been unwound, so two iterations are completed at a time
         * n and n + 1 have no prime factors in common and therefore we can
         * calculate their divisors separately
         */

        long long total_divisors;                 //the divisors of the triangle number
                                                  // n(n+1)/2

        //the first (unwound) iteration

        divisors_n_p_1 = get_divisors(n_p_1 / 2); //here n+1 is even and we

        total_divisors =
                  divisors_n
                * divisors_n_p_1;

        if(total_divisors > 1000)
            break;

        //move n and n+1 forward
        n = n_p_1;
        n_p_1 = n + 1;

        //fix the divisors
        divisors_n = divisors_n_p_1;
        divisors_n_p_1 = get_divisors(n_p_1);   //n_p_1 is now odd!

        //now the second (unwound) iteration

        total_divisors =
                  divisors_n
                * divisors_n_p_1;

        if(total_divisors > 1000)
            break;

        //move n and n+1 forward
        n = n_p_1;
        n_p_1 = n + 1;

        //fix the divisors
        divisors_n = divisors_n_p_1;
        divisors_n_p_1 = get_divisors(n_p_1 / 2);   //n_p_1 is now even!
    }

    return (n * n_p_1) / 2;
}

int main()
{
    for(int i = 0; i < 1000; i++)
    {
        using namespace std::chrono;
        auto start = high_resolution_clock::now();
        auto result = euler12();
        auto end = high_resolution_clock::now();

        double time_elapsed = duration_cast<milliseconds>(end - start).count();

        cout << result << " " << time_elapsed << '\n';
    }
    return 0;
}

That takes around 19ms on average for my desktop and 80ms for my laptop, a far cry from most of the other code I've seen here. And there are, no doubt, many optimisations still available.

votes

Change: case (divisor(T,round(math:sqrt(T))) > 500) of

To: case (divisor(T,round(math:sqrt(T))) > 1000) of

This will produce the correct answer for the Erlang multi-process example.

votes

I made the assumption that the number of factors is only large if the numbers involved have many small factors. So I used thaumkid's excellent algorithm, but first used an approximation to the factor count that is never too small. It's quite simple: Check for prime factors up to 29, then check the remaining number and calculate an upper bound for the nmber of factors. Use this to calculate an upper bound for the number of factors, and if that number is high enough, calculate the exact number of factors.

The code below doesn't need this assumption for correctness, but to be fast. It seems to work; only about one in 100,000 numbers gives an estimate that is high enough to require a full check.

Here's the code:

// Return at least the number of factors of n.
static uint64_t approxfactorcount (uint64_t n)
{
    uint64_t count = 1, add;

#define CHECK(d)                            \
    do {                                    \
        if (n % d == 0) {                   \
            add = count;                    \
            do { n /= d; count += add; }    \
            while (n % d == 0);             \
        }                                   \
    } while (0)

    CHECK ( 2); CHECK ( 3); CHECK ( 5); CHECK ( 7); CHECK (11); CHECK (13);
    CHECK (17); CHECK (19); CHECK (23); CHECK (29);
    if (n == 1) return count;
    if (n < 1ull * 31 * 31) return count * 2;
    if (n < 1ull * 31 * 31 * 37) return count * 4;
    if (n < 1ull * 31 * 31 * 37 * 37) return count * 8;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41) return count * 16;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43) return count * 32;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47) return count * 64;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53) return count * 128;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59) return count * 256;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61) return count * 512;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67) return count * 1024;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67 * 71) return count * 2048;
    if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67 * 71 * 73) return count * 4096;
    return count * 1000000;
}

// Return the number of factors of n.
static uint64_t factorcount (uint64_t n)
{
    uint64_t count = 1, add;

    CHECK (2); CHECK (3);

    uint64_t d = 5, inc = 2;
    for (; d*d <= n; d += inc, inc = (6 - inc))
        CHECK (d);

    if (n > 1) count *= 2; // n must be a prime number
    return count;
}

// Prints triangular numbers with record numbers of factors.
static void printrecordnumbers (uint64_t limit)
{
    uint64_t record = 30000;

    uint64_t count1, factor1;
    uint64_t count2 = 1, factor2 = 1;

    for (uint64_t n = 1; n <= limit; ++n)
    {
        factor1 = factor2;
        count1 = count2;

        factor2 = n + 1; if (factor2 % 2 == 0) factor2 /= 2;
        count2 = approxfactorcount (factor2);

        if (count1 * count2 > record)
        {
            uint64_t factors = factorcount (factor1) * factorcount (factor2);
            if (factors > record)
            {
                printf ("%lluth triangular number = %llu has %llu factors\n", n, factor1 * factor2, factors);
                record = factors;
            }
        }
    }
}

This finds the 14,753,024th triangular with 13824 factors in about 0.7 seconds, the 879,207,615th triangular number with 61,440 factors in 34 seconds, the 12,524,486,975th triangular number with 138,240 factors in 10 minutes 5 seconds, and the 26,467,792,064th triangular number with 172,032 factors in 21 minutes 25 seconds (2.4GHz Core2 Duo), so this code takes only 116 processor cycles per number on average. The last triangular number itself is larger than 2^68, so

votes

I modified "Jannich Brendle" version to 1000 instead 500. And list the result of euler12.bin, euler12.erl, p12dist.erl. Both erl codes use '+native' to compile.

zhengs-MacBook-Pro:workspace zhengzhibin$ time erl -noshell -s p12dist start
The result is: 842161320.

real    0m3.879s
user    0m14.553s
sys     0m0.314s
zhengs-MacBook-Pro:workspace zhengzhibin$ time erl -noshell -s euler12 solve
842161320

real    0m10.125s
user    0m10.078s
sys     0m0.046s
zhengs-MacBook-Pro:workspace zhengzhibin$ time ./euler12.bin 
842161320

real    0m5.370s
user    0m5.328s
sys     0m0.004s
zhengs-MacBook-Pro:workspace zhengzhibin$

votes

#include <stdio.h>
#include <math.h>

int factorCount (long n)
{
    double square = sqrt (n);
    int isquare = (int) square+1;
    long candidate = 2;
    int count = 1;
    while(candidate <= isquare && candidate <= n){
        int c = 1;
        while (n % candidate == 0) {
           c++;
           n /= candidate;
        }
        count *= c;
        candidate++;
    }
    return count;
}

int main ()
{
    long triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index ++;
        triangle += index;
    }
    printf ("%ld\n", triangle);
}

gcc -lm -Ofast euler.c

time ./a.out

2.79s user 0.00s system 99% cpu 2.794 total