Why is my radix sort python implementation slower than quick sort?

Question

I rewrote the original radix sort algorithm for Python from Wikipedia using arrays from SciPy to gain performance and to reduce code length, which I managed to accomplish. Then I took the classic (in-memory, pivot based) quick sort algorithm from Literate Programming and compared their performance.

I had the expectation that radix sort will beat quick sort beyond a certain threshold, which it did not. Further, I found Erik Gorset's Blog's asking the question "Is radix sort faster than quick sort for integer arrays?". There the answer is that

.. the benchmark shows the MSB in-place radix sort to be consistently over 3 times faster than quicksort for large arrays.

Unfortunately, I could not reproduce the result; the differences are that (a) Erik chose Java and not Python and (b) he uses the MSB in-place radix sort, whereas I just fill buckets inside a Python dictionary.

According to theory radix sort should be faster (linear) compared to quick sort; but apparently it depends a lot on the implementation. So where is my mistake?

Here is the code comparing both algorithms:

from sys   import argv
from time  import clock

from pylab import array, vectorize
from pylab import absolute, log10, randint
from pylab import semilogy, grid, legend, title, show

###############################################################################
# radix sort
###############################################################################

def splitmerge0 (ls, digit): ## python (pure!)

    seq = map (lambda n: ((n // 10 ** digit) % 10, n), ls)
    buf = {0:[], 1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[]}

    return reduce (lambda acc, key: acc.extend(buf[key]) or acc,
        reduce (lambda _, (d,n): buf[d].append (n) or buf, seq, buf), [])

def splitmergeX (ls, digit): ## python & numpy

    seq = array (vectorize (lambda n: ((n // 10 ** digit) % 10, n)) (ls)).T
    buf = {0:[], 1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[]}

    return array (reduce (lambda acc, key: acc.extend(buf[key]) or acc,
        reduce (lambda _, (d,n): buf[d].append (n) or buf, seq, buf), []))

def radixsort (ls, fn = splitmergeX):

    return reduce (fn, xrange (int (log10 (absolute (ls).max ()) + 1)), ls)

###############################################################################
# quick sort
###############################################################################

def partition (ls, start, end, pivot_index):

    lower = start
    upper = end - 1

    pivot = ls[pivot_index]
    ls[pivot_index] = ls[end]

    while True:

        while lower <= upper and ls[lower] <  pivot: lower += 1
        while lower <= upper and ls[upper] >= pivot: upper -= 1
        if lower > upper: break

        ls[lower], ls[upper] = ls[upper], ls[lower]

    ls[end] = ls[lower]
    ls[lower] = pivot

    return lower

def qsort_range (ls, start, end):

    if end - start + 1 < 32:
        insertion_sort(ls, start, end)
    else:
        pivot_index = partition (ls, start, end, randint (start, end))
        qsort_range (ls, start, pivot_index - 1)
        qsort_range (ls, pivot_index + 1, end)

    return ls

def insertion_sort (ls, start, end):

    for idx in xrange (start, end + 1):
        el = ls[idx]
        for jdx in reversed (xrange(0, idx)):
            if ls[jdx] <= el:
                ls[jdx + 1] = el
                break
            ls[jdx + 1] = ls[jdx]
        else:
            ls[0] = el

    return ls

def quicksort (ls):

    return qsort_range (ls, 0, len (ls) - 1)

###############################################################################
if __name__ == "__main__":
###############################################################################

    lower = int (argv [1]) ## requires: >= 2
    upper = int (argv [2]) ## requires: >= 2
    color = dict (enumerate (3*['r','g','b','c','m','k']))

    rslbl = "radix sort"
    qslbl = "quick sort"

    for value in xrange (lower, upper):

        #######################################################################

        ls = randint (1, value, size=value)

        t0 = clock ()
        rs = radixsort (ls)
        dt = clock () - t0

        print "%06d -- t0:%0.6e, dt:%0.6e" % (value, t0, dt)
        semilogy (value, dt, '%s.' % color[int (log10 (value))], label=rslbl)

        #######################################################################

        ls = randint (1, value, size=value)

        t0 = clock ()
        rs = quicksort (ls)
        dt = clock () - t0

        print "%06d -- t0:%0.6e, dt:%0.6e" % (value, t0, dt)
        semilogy (value, dt, '%sx' % color[int (log10 (value))], label=qslbl)

    grid ()
    legend ((rslbl,qslbl), numpoints=3, shadow=True, prop={'size':'small'})
    title ('radix & quick sort: #(integer) vs duration [s]')
    show ()

###############################################################################
###############################################################################

And here is the result comparing sorting durations in seconds (logarithmic vertical axis) for integer arrays of size in range from 2 to 1250 (horizontal axis); lower curve belongs to quick sort:

Radix vs Quick Sort Comparison

Quick sort is smooth at the power changes (e.g. at 10, 100 or 1000), but radix sort just jumps a little but follows otherwise qualitatively the same path as quick sort, just much slower!

An array with 1250 elements isn't really a large array. What result do you get when you sort 1000000 elements? — sth
What happens when you throw a list of 1,000,000 or 10,000,000 values at this? It looks like you've got a horribly inefficient implementation of radix sort (things like calculating 10**digit in an inner loop and lots of unnecessary function calls) so the theoretical efficiency may not be visible when you've got a mere 1250 elements to sort. — Duncan
BTW, your code is very hard to read, but it looks to me as though you are depending on iterating over a dict gives you the keys in numeric order. That means you are depending on undefined behaviour so your code could break at any time. — Duncan
@Duncan: Well, I had precomputed the powers to digit and was using a lookup to increase performance, but that did not help; I could not detect any significant improvement. What do you mean with unnecessary function calls? The lambda expressions? — hsk81
@sth: Well the problem is that the implementation apparently is so slow that I can manage to sort lists of up to 10000, after that it is terribly slow. Even at 10000 quick sort was faster, although the relative distance between the algorithms seemed to have shrunk. — hsk81

Duncan Duncan · Accepted Answer · 2011-11-26T16:53:45

You have several problems here.

First of all, as pointed out in the comments, your data set is far too small for the theoretical complexity to overcome the overheads in the code.

Next your implementation with all those unnecessary function calls and copying lists around is very inefficient. Writing the code in a straightforward procedural manner will almost always be faster than a functional solution (for Python that is, other languages will differ here). You have a procedural implementation of quicksort so if you write your radix sort in the same style it may turn out faster even for small lists.

Finally, it may be that when you do try large lists the overheads of memory management begin to dominate. That means that you have a limited window between small lists where the efficiency of the implementation is the dominant factor and large lists where the memory management is the dominant factor.

Here's some code that uses your quicksort but a simple radixsort written procedurally but trying to avoid so much copying of data. You'll see that even for short lists it beats the quicksort but more interestingly as the data size goes up so does the ratio between quicksort and radix sort and then it begins to drop again as the memory management starts to dominate (simple things like freeing a list of 1,000,000 items take a significant time):

from random import randint
from math import log10
from time import clock
from itertools import chain

def splitmerge0 (ls, digit): ## python (pure!)

    seq = map (lambda n: ((n // 10 ** digit) % 10, n), ls)
    buf = {0:[], 1:[], 2:[], 3:[], 4:[], 5:[], 6:[], 7:[], 8:[], 9:[]}

    return reduce (lambda acc, key: acc.extend(buf[key]) or acc,
        reduce (lambda _, (d,n): buf[d].append (n) or buf, seq, buf), [])

def splitmerge1 (ls, digit): ## python (readable!)
    buf = [[] for i in range(10)]
    divisor = 10 ** digit
    for n in ls:
        buf[(n//divisor)%10].append(n)
    return chain(*buf)

def radixsort (ls, fn = splitmerge1):
    return list(reduce (fn, xrange (int (log10 (max(abs(val) for val in ls)) + 1)), ls))

###############################################################################
# quick sort
###############################################################################

def partition (ls, start, end, pivot_index):

    lower = start
    upper = end - 1

    pivot = ls[pivot_index]
    ls[pivot_index] = ls[end]

    while True:

        while lower <= upper and ls[lower] <  pivot: lower += 1
        while lower <= upper and ls[upper] >= pivot: upper -= 1
        if lower > upper: break

        ls[lower], ls[upper] = ls[upper], ls[lower]

    ls[end] = ls[lower]
    ls[lower] = pivot

    return lower

def qsort_range (ls, start, end):

    if end - start + 1 < 32:
        insertion_sort(ls, start, end)
    else:
        pivot_index = partition (ls, start, end, randint (start, end))
        qsort_range (ls, start, pivot_index - 1)
        qsort_range (ls, pivot_index + 1, end)

    return ls

def insertion_sort (ls, start, end):

    for idx in xrange (start, end + 1):
        el = ls[idx]
        for jdx in reversed (xrange(0, idx)):
            if ls[jdx] <= el:
                ls[jdx + 1] = el
                break
            ls[jdx + 1] = ls[jdx]
        else:
            ls[0] = el

    return ls

def quicksort (ls):

    return qsort_range (ls, 0, len (ls) - 1)

if __name__=='__main__':
    for value in 1000, 10000, 100000, 1000000, 10000000:
        ls = [randint (1, value) for _ in range(value)]
        ls2 = list(ls)
        last = -1
        start = clock()
        ls = radixsort(ls)
        end = clock()
        for i in ls:
            assert last <= i
            last = i
        print("rs %d: %0.2fs" % (value, end-start))
        tdiff = end-start
        start = clock()
        ls2 = quicksort(ls2)
        end = clock()
        last = -1
        for i in ls2:
            assert last <= i
            last = i
        print("qs %d: %0.2fs %0.2f%%" % (value, end-start, ((end-start)/tdiff*100)))

The output when I run this is:

C:\temp>c:\python27\python radixsort.py
rs 1000: 0.00s
qs 1000: 0.00s 212.98%
rs 10000: 0.02s
qs 10000: 0.05s 291.28%
rs 100000: 0.19s
qs 100000: 0.58s 311.98%
rs 1000000: 2.47s
qs 1000000: 7.07s 286.33%
rs 10000000: 31.74s
qs 10000000: 86.04s 271.08%

Edit: Just to clarify. The quicksort implementation here is very memory friendly, it sorts in-place so no matter how large the list it is just shuffling data around not copying it. The original radixsort effectively copies the list twice for each digit: once into the smaller lists and then again when you concatenate the lists. Using itertools.chain avoids that second copy but there's still a lot of memory allocation/deallocation going on. (Also 'twice' is approximate as list appending does involve extra copying even if it is amortized O(1) so I should maybe say 'proportional to twice'.)

Why is my radix sort python implementation slower than quick sort?

2 Answers