How do I process Protocol Buffer files in Hadoop Map/Reduce using compression other than LZO?

Question

I would like to use Hadoop Map/Reduce to process delimited Protocol Buffer files that are compressed using something other than LZO, e.g. xz or gzip. Twitter's elephant-bird library seems to mainly support reading protobuf files that are LZO compressed and thus doesn't seem to meet my needs. Is there an existing library or a standard approach to doing this?

(NOTE: As you can see by my choice of compression algorithms, it's not necessary for the solution to make the protobuf files splittable. Your answer doesn't even need to specify a particular compression algorithm, but should allow for at least one of the ones I mentioned.)

fjxx fjxx · Accepted Answer · 2013-02-25T15:20:00

You may want to look into the RAgzip patch for Hadoop for processing multiple map tasks for a large gzipped file: RAgzip

How do I process Protocol Buffer files in Hadoop Map/Reduce using compression other than LZO?

1 Answers