My solution: best case 7.025 bits/number, worst case 14.193 bits/number, rough average 8.551 bits/number. Stream-encoded, no random access.
Even before reading ruslik’s answer, I immediately thought of encoding the difference between each number, since it will be small and should be relatively consistent, but the solution must also be able to accommodate the worst case scenario. We have a space of 100000 numbers that contain only 1000 numbers. In a perfectly uniform phone book, each number would be greater than the previous number by 100:
55555-12345
55555-12445
55555-12545
If that was the case, it would require zero storage to encode the differences between numbers, since it’s a known constant. Unfortunately, numbers may vary from the ideal steps of 100. I would encode the difference from the ideal increment of 100, so that if two adjacent numbers differ by 103, I would encode the number 3 and if two adjacent numbers differ by 92, I would encode -8. I call the delta from an ideal increment of 100 the “variance”.
The variance can range from -99 (i.e. two consecutive numbers) to 99000 (the entire phonebook consists of numbers 00000…00999 and an additional furthest-away number 99999), which is a range of 99100 possible values.
I’d aim to allocate a minimal storage to encode the most common differences and expand the storage if I encounter bigger differences (like ProtoBuf’s varint
). I’ll use chunks of seven bits, six for storage and an additional flag bit at the end to indicate that this variance is stored with an additional chunk after the current one, up to a maximum of three chunks (which will provide a maximum of 3 * 6 = 18 bits of storage, which are 262144 possible value, more than the number of possible variances (99100). Each additional chunk that follows a raised flag has bits of a higher significance, so the first chunk always has bits 0-5, the optional second chunks has bits 6-11, and the optional third chunk has bits 12-17.
A single chunk provides six bits of storage which can accommodate 64 values. I’d like to map the 64 smallest variances to fit in that single chunk (i.e. variances of -32 to +31) so I’ll use ProtoBuf ZigZag encoding, up to the variances of -99 to +98 (since there’s no need for a negative variance beyond -99), at which point I’ll switch to regular encoding, offset by 98:
Variance | Encoded Value
-----------+----------------
0 | 0
-1 | 1
1 | 2
-2 | 3
2 | 4
-3 | 5
3 | 6
... | ...
-31 | 61
31 | 62
-32 | 63
-----------|--------------- 6 bits
32 | 64
-33 | 65
33 | 66
... | ...
-98 | 195
98 | 196
-99 | 197
-----------|--------------- End of ZigZag
100 | 198
101 | 199
... | ...
3996 | 4094
3997 | 4095
-----------|--------------- 12 bits
3998 | 4096
3999 | 4097
... | ...
262045 | 262143
-----------|--------------- 18 bits
Some examples of how variances would be encoded as bits, including the flag to indicate an additional chunk:
Variance | Encoded Bits
-----------+----------------
0 | 000000 0
5 | 001010 0
-8 | 001111 0
-32 | 111111 0
32 | 000000 1 000001 0
-99 | 000101 1 000011 0
177 | 010011 1 000100 0
14444 | 001110 1 100011 1 000011 0
So the first three numbers of a sample phone book would be encoded as a stream of bits as follows:
BIN 000101001011001000100110010000011001 000110 1 010110 1 00001 0
PH# 55555-12345 55555-12448 55555-12491
POS 1 2 3
Best case scenario, the phone book is somewhat uniformly distributed and there are no two phone numbers that have a variance greater than 32, so it would use 7 bits per number plus 32 bits for the starting number for a total of 32 + 7*999 = 7025 bits.
A mixed scenario, where 800 phone numbers' variance fits within one chunk (800 * 7 = 5600), 180 numbers fit in two chunks each (180 * 2 * 7 = 2520) and 19 numbers fit in three chunks each (20 * 3 * 7 = 399), plus the initial 32 bits, totals 8551 bits.
Worst case scenario, 25 numbers fit in three chunks (25 * 3 * 7 = 525 bits) and the remaining 974 numbers fit in two chunks (974 * 2 * 7 = 13636 bits), plus 32 bits for the first number for a grand total of 14193 bits.
Amount of encoded numbers |
1-chunk | 2-chunks | 3-chunks | Total bits
---------+----------+----------+------------
999 | 0 | 0 | 7025
800 | 180 | 19 | 8551
0 | 974 | 25 | 14193
I can see four additional optimizations that can be performed to further reduce the space required:
- The third chunk doesn’t need the full seven bits, it can be just five bits and without a flag bit.
- There can be an initial pass of the numbers to calculate the best sizes for each chunk. Maybe for a certain phonebook, it would be optimal to have the first chunk have 5+1 bits, the second 7+1 and the third 5+1. That would further reduce the size to a minimum of 6*999 + 32 = 6026 bits, plus two sets of three bits to store the sizes of chunks 1 and 2 (chunk 3’s size is the remainder of the required 16 bits) for a total of 6032 bits!
- The same initial pass can calculate a better expected increment than the default 100. Maybe there's a phone book that starts from 55555-50000, and so it has half the number range so the expected increment should be 50. Or maybe there's a non-linear distribution (standard deviation perhaps) and some other optimal expected increment can be used. This would reduce the typical variance and might allow an even smaller first chunk to be used.
- Further analysis can be done in the first pass to allow the phone book to be partitioned, with each partition having its own expected increment and chunk size optimizations. This would allow for a smaller first chunk size for certain highly uniform parts of the phone book (reducing the number of bits consumed) and larger chunks sizes for non-uniform parts (reducing the number of bits wasted on continuation flags).