3
votes

From the book Programming Pearls 2nd Edition, quoting the question A from Column 2, section 2.1

Given a sequential file that contains at most four billion integers in random order, find a 32-bit integer that isn't in the file (and there must be at least one missing - why ?). How would you solve this problem with ample quantities of main memory ? How would you solve it if you could use several external "scratch" files but only a few hundred bytes of main memory ?

The problem statement says "at most" four billion integers. So one valid input could be in the range 100 - 299 with one missing number. If this understanding of the problem statement is correct then the required inputs for this problem are the file containing the numbers and also the range of the numbers in the file, ie: i to n.

For this problem isn't the following O(n) solution more intuitive than the one given in the book (from Ed Reingold) ? Or am I missing something ?

Assume the given range is i...n

using the forumla (n * (n + 1) / 2) 
  x = the sum of numbers from 1 to i-1 
  y = the sum of numbers from 1 to n 
walk through the input and get a sum of the numbers (value z)

missing number = (y - x - z)
1
You are missing something: (1) The numbers don't said to be unique, the summation approach assumes unique elements and only one missing in the range (2) You don't know the range of the numbers, in 32 bit integer can have much more then 4B elements.amit
You are making a lot of assumptions, which going by the question statement seem wrongasheeshr
The problem statement doesn't mean that there is only one missing number (or that there are no duplicates). Your algorithm is solving a different problem where there is exactly one missing number from a range.Michael Burr
There are slightly more than 4 billion integers that would fit in 32 bits. So we have an upper range on the number of possible values. With plenty of memory, you could use a bitmap to track the numbers you have found. With just a small amount of memory, you will have to work harder. :-)Bo Persson

1 Answers

4
votes

You are missing something:

  1. The numbers don't said to be unique, the summation approach assumes unique elements and only one missing in the range
  2. You don't know the range of the numbers, in 32 bit integer can have much more then 4B elements (to be exact, there are 294967296 more numbers that can be represented by 32 bits then 4B)

Look at the simplified counter example with range = [1,5], array = [5,5,5,4,1].
In this case, you will get x=1, y = 15, z = 20.
However, 20-15-1 = 4, and it is not missing.

You can use radix sort, which runs in O(n) in this case (because 32 bits is constant), and then scan the sorted array to find the first missing element.

EDIT: A more efficient way to do it is yet with a variation of radix sort and selection algorithm:

  1. Let your current number of bits be k. Look at the first bit, and partition the array into two parts, the first with this bit unset, and the second with this bit set.
  2. In at least one of the parts - there must be less then (2^k) / 2 = 2^(k-1) elements.
  3. Reduce the problem to this sub array only and use k' = k-1 (reduce the current number of bits) and return to 1.

Keep doing it until you exhausted your bits, and you will get a number that is not in the original list.

Note that assuming the list is random enough, the complexity of the algorithm is O(n) - for any number of bits (and not O(n*k))