We have a question in specific regard to compressed input on an Amazon EMR Hadoop job.
According to AWS:
"Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you."
q.v., http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HowtoProcessGzippedFiles.html
Which seems good--however, looking into BZip2, it appears that the "split" boundaries would be file-based:
.magic:16 = 'BZ' signature/magic number
.version:8 = 'h' for Bzip2 ('H'uffman coding), '0' for Bzip1 (deprecated)
.hundred_k_blocksize:8 = '1'..'9' block-size 100 kB-900 kB (uncompressed)
**-->.compressed_magic:48 = 0x314159265359 (BCD (pi))**
.crc:32 = checksum for this block
.randomised:1 = 0=>normal, 1=>randomised (deprecated)
.origPtr:24 = starting pointer into BWT for after untransform
.huffman_used_map:16 = bitmap, of ranges of 16 bytes, present/not present
.huffman_used_bitmaps:0..256 = bitmap, of symbols used, present/not present (multiples of 16)
.huffman_groups:3 = 2..6 number of different Huffman tables in use
.selectors_used:15 = number of times that the Huffman tables are swapped (each 50 bytes)
*.selector_list:1..6 = zero-terminated bit runs (0..62) of MTF'ed Huffman table (*selectors_used)
.start_huffman_length:5 = 0..20 starting bit length for Huffman deltas
*.delta_bit_length:1..40 = 0=>next symbol; 1=>alter length
.contents:2..8 = Huffman encoded data stream until end of block
**-->.eos_magic:48 = 0x177245385090 (BCD sqrt(pi))**
.crc:32 = checksum for whole stream
.padding:0..7 = align to whole byte
With the statement: "Like gzip, bzip2 is only a data compressor. It is not an archiver like tar or ZIP; the program itself has no facilities for multiple files, encryption or archive-splitting, but, in the UNIX tradition, relies instead on separate external utilities such as tar and GnuPG for these tasks."
q.v., http://en.wikipedia.org/wiki/Bzip2
The combination of these two statements I interpret to mean that BZip2 is "split-able", but does so on a by-file basis . . . .
This is relevant, because our job will be receiving a single ~800MiB file via S3--which (if my interpretation is true) would mean that EC2/Hadoop would assign ONE Mapper to the job (for ONE file), which would be sub-optimal, to say the least.
(That being the case, we would obviously need to find a way to partition the input into a set of a 400 files before BZip2 is applied as a solution).
Does anyone know for certain if this is how AWS/EMR Hadoop jobs internally function?
Cheers!