215
votes

I've been wondering this for some time. As the title say, which is faster, the actual function or simply raising to the half power?

UPDATE

This is not a matter of premature optimization. This is simply a question of how the underlying code actually works. What is the theory of how Python code works?

I sent Guido van Rossum an email cause I really wanted to know the differences in these methods.

My email:

There are at least 3 ways to do a square root in Python: math.sqrt, the '**' operator and pow(x,.5). I'm just curious as to the differences in the implementation of each of these. When it comes to efficiency which is better?

His response:

pow and ** are equivalent; math.sqrt doesn't work for complex numbers, and links to the C sqrt() function. As to which one is faster, I have no idea...

14
That's awesome the Guido responds to email. - Evan Fosmark
Evan, I was surprised I got a response - Nope
I don't think this is a bad question. For example, x * x is a full 10 times faster than x ** 2. Readability is a tossup in this situation, so why not do the fast way? - TM.
Casey, I'm with you on the "premature optimization" thing. :) Your question does not look like premature optimization to me: there is no risk that any of the variants breaks your code. It's more a matter of knowing better what you do (in terms of execution time) when you choose pow() over math.sqrt(). - Eric O Lebigot
This isn't premature optimization, but rather avoiding premature pessimization (ref. no. 28, C++ coding standards, A.Alexandrescu). If math.sqrt is a more optimized routine (as it is) and expresses the intent more clearly, it should always be preferred over x**.5. It is not premature optimization to know what you write, and chose the alternative that is faster and provides more code clarity. If so, you need to argue equally well why you would chose the other alternatives. - swalog

14 Answers

99
votes

math.sqrt(x) is significantly faster than x**0.5.

import math
N = 1000000
%%timeit
for i in range(N):
    z=i**.5

10 loops, best of 3: 156 ms per loop

%%timeit
for i in range(N):
    z=math.sqrt(i)

10 loops, best of 3: 91.1 ms per loop

Using Python 3.6.9 (notebook).

20
votes
  • first rule of optimization: don't do it
  • second rule: don't do it, yet

Here's some timings (Python 2.5.2, Windows):

$ python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
1000000 loops, best of 3: 0.445 usec per loop

$ python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
1000000 loops, best of 3: 0.574 usec per loop

$ python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
1000000 loops, best of 3: 0.727 usec per loop

This test shows that x**.5 is slightly faster than sqrt(x).

For the Python 3.0 the result is the opposite:

$ \Python30\python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
1000000 loops, best of 3: 0.803 usec per loop

$ \Python30\python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
1000000 loops, best of 3: 0.695 usec per loop

$ \Python30\python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
1000000 loops, best of 3: 0.761 usec per loop

math.sqrt(x) is always faster than x**.5 on another machine (Ubuntu, Python 2.6 and 3.1):

$ python -mtimeit -s"from math import sqrt; x = 123" "x**.5"
10000000 loops, best of 3: 0.173 usec per loop
$ python -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
10000000 loops, best of 3: 0.115 usec per loop
$ python -mtimeit -s"import math; x = 123" "math.sqrt(x)"
10000000 loops, best of 3: 0.158 usec per loop
$ python3.1 -mtimeit -s"from math import sqrt; x = 123" "x**.5"
10000000 loops, best of 3: 0.194 usec per loop
$ python3.1 -mtimeit -s"from math import sqrt; x = 123" "sqrt(x)"
10000000 loops, best of 3: 0.123 usec per loop
$ python3.1 -mtimeit -s"import math; x = 123" "math.sqrt(x)"
10000000 loops, best of 3: 0.157 usec per loop
12
votes

In these micro-benchmarks, math.sqrt will be slower, because of the slight time it takes to lookup the sqrt in the math namespace. You can improve it slightly with

 from math import sqrt

Even then though, running a few variations through timeit, show a slight (4-5%) performance advantage for x**.5

Interestingly, doing

 import math
 sqrt = math.sqrt

sped it up even more, to within 1% difference in speed, with very little statistical significance.


I will repeat Kibbee, and say that this is probably a premature optimization.

9
votes

How many square roots are you really performing? Are you trying to write some 3D graphics engine in Python? If not, then why go with code which is cryptic over code that is easy to read? The time difference is would be less than anybody could notice in just about any application I could forsee. I really don't mean to put down your question, but it seems that you're going a little too far with premature optimization.

7
votes

In python 2.6 the (float).__pow__() function uses the C pow() function and the math.sqrt() functions uses the C sqrt() function.

In glibc compiler the implementation of pow(x,y) is quite complex and it is well optimized for various exceptional cases. For example, calling C pow(x,0.5) simply calls the sqrt() function.

The difference in speed of using .** or math.sqrt is caused by the wrappers used around the C functions and the speed strongly depends on optimization flags/C compiler used on the system.

Edit:

Here are the results of Claudiu's algorithm on my machine. I got different results:

zoltan@host:~$ python2.4 p.py 
Took 0.173994 seconds
Took 0.158991 seconds
zoltan@host:~$ python2.5 p.py 
Took 0.182321 seconds
Took 0.155394 seconds
zoltan@host:~$ python2.6 p.py 
Took 0.166766 seconds
Took 0.097018 seconds
4
votes

For what it's worth (see Jim's answer). On my machine, running python 2.5:

PS C:\> python -m timeit -n 100000 10000**.5
100000 loops, best of 3: 0.0543 usec per loop
PS C:\> python -m timeit -n 100000 -s "import math" math.sqrt(10000)
100000 loops, best of 3: 0.162 usec per loop
PS C:\> python -m timeit -n 100000 -s "from math import sqrt" sqrt(10000)
100000 loops, best of 3: 0.0541 usec per loop
4
votes

using Claudiu's code, on my machine even with "from math import sqrt" x**.5 is faster but using psyco.full() sqrt(x) becomes much faster, at least by 200%

3
votes

Most likely math.sqrt(x), because it's optimized for square rooting.

Benchmarks will provide you the answer you are looking for.

3
votes

Someone commented about the "fast Newton-Raphson square root" from Quake 3... I implemented it with ctypes, but it's super slow in comparison to the native versions. I'm going to try a few optimizations and alternate implementations.

from ctypes import c_float, c_long, byref, POINTER, cast

def sqrt(num):
 xhalf = 0.5*num
 x = c_float(num)
 i = cast(byref(x), POINTER(c_long)).contents.value
 i = c_long(0x5f375a86 - (i>>1))
 x = cast(byref(i), POINTER(c_float)).contents.value

 x = x*(1.5-xhalf*x*x)
 x = x*(1.5-xhalf*x*x)
 return x * num

Here's another method using struct, comes out about 3.6x faster than the ctypes version, but still 1/10 the speed of C.

from struct import pack, unpack

def sqrt_struct(num):
 xhalf = 0.5*num
 i = unpack('L', pack('f', 28.0))[0]
 i = 0x5f375a86 - (i>>1)
 x = unpack('f', pack('L', i))[0]

 x = x*(1.5-xhalf*x*x)
 x = x*(1.5-xhalf*x*x)
 return x * num
2
votes

The Pythonic thing to optimize for is readability. For this I think explicit use of the sqrt function is best. Having said that, let's investigate performance anyway.

I updated Claudiu's code for Python 3 and also made it impossible to optimize away the calculations (something a good Python compiler may do in the future):

from sys import version
from time import time
from math import sqrt, pi, e

print(version)

N = 1_000_000

def timeit1():
  z = N * e
  s = time()
  for n in range(N):
    z += (n * pi) ** .5 - z ** .5
  print (f"Took {(time() - s):.4f} seconds to calculate {z}")

def timeit2():
  z = N * e
  s = time()
  for n in range(N):
    z += sqrt(n * pi) - sqrt(z)
  print (f"Took {(time() - s):.4f} seconds to calculate {z}")

def timeit3(arg=sqrt):
  z = N * e
  s = time()
  for n in range(N):
    z += arg(n * pi) - arg(z)
  print (f"Took {(time() - s):.4f} seconds to calculate {z}")

timeit1()
timeit2()
timeit3()

Results vary, but a sample output is:

3.6.6 (default, Jul 19 2018, 14:25:17) 
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)]
Took 0.3747 seconds to calculate 3130485.5713865166
Took 0.2899 seconds to calculate 3130485.5713865166
Took 0.2635 seconds to calculate 3130485.5713865166

And a more recent output:

3.7.4 (default, Jul  9 2019, 16:48:28) 
[GCC 8.3.1 20190223 (Red Hat 8.3.1-2)]
Took 0.2583 seconds to calculate 3130485.5713865166
Took 0.1612 seconds to calculate 3130485.5713865166
Took 0.1563 seconds to calculate 3130485.5713865166

Try it yourself.

1
votes

Claudiu's results differ from mine. I'm using Python 2.6 on Ubuntu on an old P4 2.4Ghz machine... Here's my results:

>>> timeit1()
Took 0.564911 seconds
>>> timeit2()
Took 0.403087 seconds
>>> timeit1()
Took 0.604713 seconds
>>> timeit2()
Took 0.387749 seconds
>>> timeit1()
Took 0.587829 seconds
>>> timeit2()
Took 0.379381 seconds

sqrt is consistently faster for me... Even Codepad.org NOW seems to agree that sqrt, in the local context, is faster (http://codepad.org/6trzcM3j). Codepad seems to be running Python 2.5 presently. Perhaps they were using 2.4 or older when Claudiu first answered?

In fact, even using math.sqrt(i) in place of arg(i), I still get better times for sqrt. In this case timeit2() took between 0.53 and 0.55 seconds on my machine, which is still better than the 0.56-0.60 figures from timeit1.

I'd say, on modern Python, use math.sqrt and definitely bring it to local context, either with somevar=math.sqrt or with from math import sqrt.

1
votes

Of course, if one is dealing with literals and need a constant value, Python runtime can pre-calculate the value at compile time, if it is written with operators - no need to profile each version in this case:

In [77]: dis.dis(a)                                                                                                                       
  2           0 LOAD_CONST               1 (1.4142135623730951)
              2 RETURN_VALUE

In [78]: def a(): 
    ...:     return 2 ** 0.5 
    ...:                                                                                                                                  

In [79]: import dis                                                                                                                       

In [80]: dis.dis(a)                                                                                                                       
  2           0 LOAD_CONST               1 (1.4142135623730951)
              2 RETURN_VALUE

0
votes

The problem SQRMINSUM I've solved recently requires computing square root repeatedly on a large dataset. The oldest 2 submissions in my history, before I've made other optimizations, differ solely by replacing **0.5 with sqrt(), thus reducing the runtime from 3.74s to 0.51s in PyPy. This is almost twice the already massive 400% improvement that Claudiu measured.

-3
votes

What would be even faster is if you went into math.py and copied the function "sqrt" into your program. It takes time for your program to find math.py, then open it, find the function you are looking for, and then bring that back to your program. If that function is faster even with the "lookup" steps, then the function itself has to be awfully fast. Probably will cut your time in half. IN summary:

  1. Go to math.py
  2. Find the function "sqrt"
  3. Copy it
  4. Paste function into your program as the sqrt finder.
  5. Time it.