8
votes

It's in the section 2.6 and problem 2, the original problem is like this:

"Given a sequential file containing 4,300,000,000 32-bit integers, how can you find one that appears at least twice?"

My question toward this exercise is that: what is the tricks of the above problem and what kind of general algorithm category this problem is in?

4
the solution given in the book is binary searchDavid Heffernan

4 Answers

10
votes

Create a bit array of length 2^32 bits (initialize to zero), that would be about 512MB and will fit into RAM on any modern machine.

Start reading the file, int by int, check bit with the same index as the value of the int, if the bit is set you have found a duplicate, if it is zero, set to one and proceed with the next int from the file.

The trick is to find a suitable data structure and algorithm. In this case everything fits into RAM with a suitable data structure and a simple and efficient algorithm can be used.
If the numbers are int64 you need to find a suitable sorting strategy or make multiple passes, depending on how much additional storage you have available.

7
votes

The Pigeonhole Principle -- If you have N pigeons in M pigeonholes, and N>M, there are at least 2 pigeons in a hole. The set of 32-bit integers are our 2^32 pigeonholes, the 4.3 billion numbers in our file are the pigeons. Since 4.3x10^9 > 2^32, we know there are duplicates.

You can apply this principle to test if a duplicate we're looking for is in a subset of the numbers at the cost of reading the whole file, without loading more than a little at a time into RAM-- just count the number of times you see a number in your test range, and compare to the total number of integers in that range. For example, to check for a duplicate between 1,000,000 and 2,000,000 inclusive:

int pigeons = 0;
int pigeonholes = 2000000 - 1000000 + 1; // include both fenceposts
for (each number N in file) {
  if ( N >= 1000000 && N <= 2000000 ) {
    pigeons++
  }
}
if (pigeons > pigeonholes) {
  // one of the duplicates is between 1,000,000 and 2,000,000
  // try again with a narrower range
} 

Picking how big of range(s) to check vs. how many times you want to read 16GB of data is up to you :)

As far as a general algorithm category goes, this is a combinatorics (math about counting) problem.

-1
votes

If what do you mean is 32 bit positive integers, I think this problem doesn't require some special algorithm or trick to solve. Just a simple observation will lead to the intended solution.

My observation goes like this, the sequential file will contain only 32 bit integers (which is from 0 to 2 ^ 31 - 1). Assume you put all of them in that file uniquely, you will end up with 2 ^ 31 lines. You can see that if you put those positive integers once again, you will end up with 2 ^ 31 * 2 lines and it is smaller than 4,300,000,000.

Thus, the answer is the whole positive integers ranging from 0 to 2 ^ 31 - 1.

-1
votes

Sort the integers and loop through them to see if consecutive integers are duplicates. If you want to do this in memory, it requires 16GB memory that is possible with todays machines. If this is not possible, you could sort the numbers using mergesort and by store intermediate arrays to disk.

My first implementation attempt would be to use sort and uniq commands from unix.