The Pigeonhole Principle -- If you have N pigeons in M pigeonholes, and N>M, there are at least 2 pigeons in a hole. The set of 32-bit integers are our 2^32 pigeonholes, the 4.3 billion numbers in our file are the pigeons. Since 4.3x10^9 > 2^32, we know there are duplicates.
You can apply this principle to test if a duplicate we're looking for is in a subset of the numbers at the cost of reading the whole file, without loading more than a little at a time into RAM-- just count the number of times you see a number in your test range, and compare to the total number of integers in that range. For example, to check for a duplicate between 1,000,000 and 2,000,000 inclusive:
int pigeons = 0;
int pigeonholes = 2000000 - 1000000 + 1; // include both fenceposts
for (each number N in file) {
if ( N >= 1000000 && N <= 2000000 ) {
pigeons++
}
}
if (pigeons > pigeonholes) {
// one of the duplicates is between 1,000,000 and 2,000,000
// try again with a narrower range
}
Picking how big of range(s) to check vs. how many times you want to read 16GB of data is up to you :)
As far as a general algorithm category goes, this is a combinatorics (math about counting) problem.