1
votes

I would like to use Hadoop Map/Reduce to process delimited Protocol Buffer files that are compressed using something other than LZO, e.g. xz or gzip. Twitter's elephant-bird library seems to mainly support reading protobuf files that are LZO compressed and thus doesn't seem to meet my needs. Is there an existing library or a standard approach to doing this?

(NOTE: As you can see by my choice of compression algorithms, it's not necessary for the solution to make the protobuf files splittable. Your answer doesn't even need to specify a particular compression algorithm, but should allow for at least one of the ones I mentioned.)

1

1 Answers

1
votes

You may want to look into the RAgzip patch for Hadoop for processing multiple map tasks for a large gzipped file: RAgzip