Hadoop handling compression transparently, but not splitting LZO

Question

it looks as if Hadoop handles compression transparently (when was this introduced, I don't remember it on 0.20.203) when using TextInputFormat. Unfortunately, when using LZO compression, Hadoop doesn't use the LZO index file to make the file splittable. However, if I set the input format to com.hadoop.mapreduce.LzoTextInputFormat, the file is split.

Is it possible to configure Hadoop to decompress LZO files and split them when using TextInputFormat?

Did you ever get Hadoop to use the LZO index file by default @schmmd? I'm still observing this behavior in CDH4.4.0 — Andrew

Kevin Hsu Kevin Hsu · Accepted Answer · 2014-01-31T19:54:42

I'm just running into a similar issue, and here's my understanding:

You want to use LzoTextInputFormat in your code. If you want to process a mix of lzo and non-lzo files, you should set lzo.text.input.format.ignore.nonlzo to false. In this case, LzoTextInputFormat will be used for all lzo files, but it will default to TextInputFormat for other files (It's smart enough to ignore the index files).

This feature may not have been around when this question was first asked, so you might already be aware of this solution.

Please see (there's a comment about ignore.nonlzo): https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java

Hadoop handling compression transparently, but not splitting LZO

1 Answers