I would like to use Hadoop Map/Reduce to process delimited Protocol Buffer files that are compressed using something other than LZO, e.g. xz
or gzip
. Twitter's elephant-bird library seems to mainly support reading protobuf files that are LZO compressed and thus doesn't seem to meet my needs. Is there an existing library or a standard approach to doing this?
(NOTE: As you can see by my choice of compression algorithms, it's not necessary for the solution to make the protobuf files splittable. Your answer doesn't even need to specify a particular compression algorithm, but should allow for at least one of the ones I mentioned.)