5
votes

I would like to explore my analysis regarding Bucket sort as below.
There are many ways that Bucket sort can be implemented. Some of them are as follows.
Type 1:
if we know the range of our elements to be sorted, we can set up buckets for each possible element, and just toss elements into their corresponding buckets. We then empty the buckets in order, and the result is a sorted list. In implementing this algorithm, we can easily use an array to represent our buckets, where the value at each array index will represent the number of elements in the corresponding bucket. If we have integers on the range [0..max], then we set up an array of (max + 1) integers and initialize all the values to zero. We then proceed sequentially through the unsorted array, reading the value of each element, going to the corresponding index in the buckets array, and incrementing the value there.

Time : O(N)
Space: O(1)
Type 2:

Example:to sort an array of people by age
Age is somewhat different than arbitrary integers for sorting. Because of that it has a small range [0-150] (all people's ages are between 0 and 150). So the quickest way to sort it would be to allocate 151 linked lists (let's call them buckets) and put each person's data structure in the bucket according to his/her age:

Time : O(N+K)
Space: O(N+K)

Type 3 (Variation of type2 as shown in Wikepedia)

The function nextSort is a sorting function for sorting each bucket. if insertion sort used than than worst will be O(n^2) or Merge sort will be used so that I can keep stability than O(nlgn).

  • Question:
    1>How it is considered linear sort, is it because of Type 1 or Type 2?
    2>If I use Type 3 like WIkepedia which sort is efficient sorting each bucket?
    I know The reason insertion sort is used in practice is that we expect the buckets to be small, and for small lists, insertion sort is much faster than anything else. Even when implementing merge sort or quicksort, insertion sort is used when the list gets small enough (say below 20 items or so).
    3>for type 3 on which basis I can decide range of buckets?
    This is important because if you try doing a bucket sort with a huge number of buckets for example much greater than n, the runtime might be dominated by the time required to scan over all the buckets looking for the buckets that you actually used, even if most of them are empty.

I did Analysis based on:
Wikepedia
How could the complexity of bucket sort is O(n+k)?
Design and Analysis of Algorithms Lecture notes for January 23, 1996
http://www1bpt.bridgeport.edu/~dichter/lilly/bucketsort.htm
http://cs.nyu.edu/courses/fall02/V22.0310-002/lectures/lecture-23.html
How is the complexity of bucket sort is O(n+k) if we implement buckets using linked lists?
What is the worst case complexity for bucket sort?

3

3 Answers

5
votes

Type 1:
The first type you describe is not really bucket sort. It's actually counting sort or key index counting. Although it's considered a variant on bucket sort. The reason is because you're actually just counting the occurences of each key, rather than storing the keys itself in the buckets.

Ref: http://en.wikipedia.org/wiki/Counting_sort
Ref: http://www.cs.princeton.edu/courses/archive/spr13/cos226/demo/51DemoKeyIndexedCounting.pdf

Space: O(1)
we can set up buckets for each possible element,

Isn't this contradictory? You're going to declare buckets for every possible element and still keep O(1)? ;)

If you want the algorithm to be stable, you cannot overwrite the input array either. So in practice you need space requirement n + k for:

  • An output array of the length 'n' (same size as your input array basicly)
  • 'k' buckets

If you check the pseudocode for counting sort, you will notice the last loop goes over the input array again to see where every element needs to go. By doing this in the order they appear in the input-array you get a stable sort.

PS: Keep in mind that you're not necessarily sorting integers. If the input is an array of characters between A-Z you can also use this algorithm.

Type 2:

So the quickest way to sort it would be to allocate 151 linked lists (let's call them buckets) and put each person's data structure in the bucket according to his/her age:

It might be the most easy way because you can find the needed bucket rather easy, but it's not necessarily the quickest way ;). Another possibility for example is to create buckets for every 10 years.

00 - 09
10 - 19
20 - 29
...

And when you want to insert something into a bucket, you can do:

  • Binary search on the bucket (LinkedList for example) to find the right position
  • Insert the element

This way you also don't need to sort the buckets afterwards because everything is already sorted. Not saying it is a good idea, just pointing out the possibility. ;)

Questions:
1) Simply put; It's a linear sort because if it takes linear time to sort. Both type 1 and type 2 take O(n + k). Because bucket sort doesn't use comparisons between elements like quicksort, bubblesort, ... it is not bound by the O(n log n) lowerbound. Keep in mind that the O-notation doesn't give any guarantee about speed, it gives a guarantee of growth rate. If your input size doubles from 'N' to '2N' your linear time algorithm will cope with it better than for example an O(n^2) algorithm like bubble sort. ;)

2) Insertion sort is indeed efficient for small arrays and that's mainly the reason why it's chosen. + the fact that it's stable. Because if you don't use a stable algorithm to sort the buckets itself, the whole algorithm (bucket sort) won't be stable.

3) Hard to say. It depends on the data in my opinion. If you have to sort 1 million 32bit integers, you're not going to create 2^32 buckets for them. In that case it might be nice to take a look at other algorithms (for example LSD Radix sort) which would basicly create 9 buckets (1 for every digit).

1
votes

Bucket sorting is linear-time when the buckets each are sorted in linear time. "Type 1" and "Type 2" both are linear-time because all of the values in each bucket compare pairwise equal and need no further sorting.

The answer to your latter two questions is whatever works in practice. Usually the writer of your standard-library sort has determined the appropriate cutoff for insertion sort. I imagine that the performance of bucket sort depends heavily on the data and the memory subsystem in question.

0
votes

Type 1 and Type 2 that you described are actually the same meaning you have a range. Yes, in that case, it's linear time complexity since no further sorting is required within each bucket. Each bucket contains a single type of values.