4
votes

The problem is, given an input file with four billion unique integers, provide an algorithm to generate an integer which is not contained in the file, assume only have 10 MB of memory.

Searched for some solutions and posted code below, one of which is to store integers into bit-vector blocks (each block representing a specific range of integers among 4 billion range, each bit in the block represent for an integer), and using another counter for each block, to count the number of integers in each block. So that if number of integers is less than the block capacity for integers, scan the bit-vector of the block to find which are missing integers.

My question for this solution is, why "the nearer to the middle that we pick, the less memory will be used at any given time", here are more context,

The array in the first pass can fit in 10 megabytes, or roughly 2^23 bytes, of memory. Since each element in the array is an int, and an int is 4 bytes, we can hold an array of at most about 2^21 elements. So, we can deduce the following:

enter image description here

Therefore, we can conclude the following: 2^10< rangeSize <2^26, and these conditions give us a good amount of "wiggle room," but the nearer to the middle that we pick, the less memory will be used at any given time.

public class QuestionB {
    public static int bitsize = 1048576; // 2^20 bits (2^17 bytes)
    public static int blockNum = 4096; // 2^12
    public static byte[] bitfield = new byte[bitsize/8];
    public static int[] blocks = new int[blockNum];

    public static void findOpenNumber() throws FileNotFoundException {
        int starting = -1;
        Scanner in = new Scanner (new FileReader ("Chapter 10/Question10_3/input_file_q10_3.txt"));
        while (in.hasNextInt()) {
            int n = in.nextInt();
            blocks[n / (bitfield.length * 8)]++;
        }

        for (int i = 0; i < blocks.length; i++) {
            if (blocks[i] < bitfield.length * 8){
                /* if value < 2^20, then at least 1 number is missing in
                 * that section. */
                starting = i * bitfield.length * 8;
                break;
            }
        }

        in = new Scanner(new FileReader("Chapter 10/Question10_3/input_file_q10_3.txt"));
        while (in.hasNextInt()) {
            int n = in.nextInt();
            /* If the number is inside the block that’s missing 
             * numbers, we record it */
            if (n >= starting && n < starting + bitfield.length * 8) {
                bitfield [(n-starting) / 8] |= 1 << ((n - starting) % 8);
            }
        }

        for (int i = 0 ; i < bitfield.length; i++) {
            for (int j = 0; j < 8; j++) {
                /* Retrieves the individual bits of each byte. When 0 bit 
                 * is found, finds the corresponding value. */
                if ((bitfield[i] & (1 << j)) == 0) {
                    System.out.println(i * 8 + j + starting);
                    return;
                }
            }
        }       
    }

    public static void main(String[] args) throws FileNotFoundException {
        findOpenNumber();
    }

}
2
Is your question why the middle between 2¹⁰ and 2²⁶ takes less memory?RealSkeptic
Does the file have four billion unique integers? Or, are duplicates a possibility?Matthew McPeak
You forgot to clearly state your question. After reading your post, I can only think "thats interesting, but what are you asking?"Durandal
Are numbers unique? And are they consecutive, except for the one missing?fps
Since int range is from around -2b to +2b, and you say the set is distinct, negative numbers must exist right? So this causes negative array index eval in first while loop.hsnkhrmn

2 Answers

3
votes

If you form M blocks each of size 2^32/M, the total memory required is M+2^27/M words (32 bits). This function reaches a minimum when M=√2^27, which is halfway between 1 and 2^27 blocks. The minimum is 2^14.5 words, about 92 KBytes.

This is very clear on a bilogarithmic plot.

enter image description here

1
votes

I like this question. I'll give it additional thought but I think if disk space and time is not an issue, you can break the numbers into 100k blocks, and sort them in each file. Any block that doesn't have 100k entries will have a gap. It's not elegant at all but it gets the ball rolling.