While helping out a student with his classes, I implemented the dual pivot quicksort algorithm to prepare a session and got intriged. After running some statistics, then solving the worst case situation, then running stats again, and again solving the next worst case situation, and repeating this process several times, the resulting code is no more then 80 lines of simple straightforward Python code (a bit less then Vladimir's code). The novel part is how the 3 partitions are constructed in combination with some very simple yet effective post processing of them. Now I need some help on how to test and make statistics properly.
Especially about how to count the swaps: most of the swaps only perform two assignements instead of three. So must I count them as full swaps or, is it fair to count them only as a '2/3' swap?
Counting every swap as 1
, the Cn
in Cn * N * log2(N)
is around 0.48
on short lists (<100 elements) and around 0.55
on longer lists of several million elements. That is just the theoretical minimum as calculated by Vladimir Yaroslavskiy.
Counting the lighter swaps as 2/3
instead, the number of needed swaps is almost equal for any list size and is around 0.36
(stdev around 0.015
).
The Cn
for the number of comparisons is on average around 1.3
for lists of 2 million records, which is less then the theoretical 1.38 (from 2*N*ln(N)), and lower for shorter lists, i.e. for 1024 elements, it's around 1.21
That is for lists with 100% unique numbers and randomly ordered with Python's random.shuffle()
.
So my question is:
Is it ok to count the lighter swaps as such, and is the result indeed promising or not?
Also interesting is:
- the more equal elements in the list, the faster is sorts.
Cn
is0.03
and0.1
for swaps and comparisons respectively for a 2 million list of all equal elements. Cn
for sorted and reversed sorted lists are almost the same for all sizes:0.3
and1
for the swaps (counted with2/3
) and comparisons respectively.
I will post a list with more statistics shortly which includes maximum stack depth, number of recursive calls besides the swaps and comparisons. Are there other things I should count?
Also, are there some 'standard' test suites with files of all kinds of situations (with equals, partially sorted etc.) one can use to test a sorting algorithm, and to make the results comparable with other sorting algorithms.
Added May 5: I improved the algorithm especially for sorted lists. Here are the resutls for 20 runs for each. Are this good results?
New statistics:
Random.shuffle(), unique number
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.367 0.922 0.250
64 0.360 1.072 0.500
256 0.342 1.122 0.625
1024 0.358 1.156 0.800
4096 0.359 1.199 0.917
16384 0.359 1.244 1.071
65536 0.360 1.244 1.125
262144 0.360 1.269 1.167
1048576 0.362 1.275 1.200
Sorted, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.172 0.531 0.250
64 0.117 0.586 0.333
256 0.087 0.609 0.375
1024 0.075 0.740 0.500
4096 0.060 0.732 0.500
16384 0.051 0.726 0.500
65536 0.044 0.722 0.500
262144 0.041 0.781 0.556
1048576 0.036 0.774 0.550
2097152 0.035 0.780 0.571
Reversed order, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.344 0.828 0.250
64 0.279 0.812 0.333
256 0.234 0.788 0.375
1024 0.210 0.858 0.500
4096 0.190 0.865 0.500
16384 0.172 0.855 0.500
65536 0.158 0.846 0.500
262144 0.153 0.900 0.556
1048576 0.143 0.892 0.550
2097152 0.140 0.895 0.571