hadoop: automatic splittable output from lzo compression

Question

I am setting up lzo codec to use as the compression tool in my hadoop jobs. I know that lzo has the desirable feature of creating splittable files. But I have not found a way to get lzo create splittable files automatically. The blogs I have read so far all mention using indexer outside the job and feeding the output lzo file as the input to the mapreduce job.

I am using some hadoop benchmarks where I do not want to change the benchmark code, just use lzo compression in hadoop to see its effect on the benchmark. I am planning to use lzo as codec for compressing map output, but if the output is not splittable, the next phase will have to get the whole compressed output in the nodes to be able to work.

Any hadoop configuration option to instruct lzo to make the output files splittable, so it is transparently done?

It requires another map-reduce job to build an LZO-index. It's why we use Snappy :) — Ivan Klass
thanks! lzo-dev folks should consider option to create auto-indexed compressed output. — nom-mon-ir
@Klass Ivan, are you sure Snappy creates splittable output? I went to check further details and it is similar to lzo in that the output is not randomly seekable. — nom-mon-ir
No, it doesn't. Sorry for the confusion :\ Here is an useful article comparing LZO to Snappy: blog.cloudera.com/blog/2011/09/snappy-and-hadoop — Ivan Klass

Carlo Medas Carlo Medas · Accepted Answer · 2016-09-16T07:16:38

BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.

LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.

LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.

ZSTD is even better compression, supported as well by hadoop-4mc.

hadoop: automatic splittable output from lzo compression

2 Answers