1
votes

I wrote my UDF to load file into Pig. It works well for loading text file, however, now I need also be able to read .gz file. I know I can unzip the file then process, but I want just read .gz file without to unzip it.

I have my UDF extends from LoadFunc, then in my costom input file MyInputFile extends TextInputFormat. I also Implemented MyRecordReader. Just wondering if extends TextInputFormat is the problem? I tried FileInputFormat, still cannot read the file. Anyone wrote UDF read data from .gz file before?

1
TextInputFormat can handle gzip files. Have a look at its RecordReader's (LineRecordReader) initialize() method where the proper CompressionCodec is initialized. Also note that gzip files aren't splittable. - Lorand Bendig
Thanks for pointing this out. If it is not splittable then I think I will consider to unzip it first. Much appreciate if you can point out some best practice for pre-unzip the file then load in to PIG. Like what is the best way to do this? Thanks. - Simon Guo
Without knowing the data size the easiest way would be to store your data uncompressed on hdfs. you may also repack it using a splittable format (LZO). How to extract gzip file; local disk->HDFS, see: bigdatanoob.blogspot.hu/2011/07/… . If already on hdfs: hadoop fs -cat /data/data.gz | gzip -d | hadoop fs -put - /data/data.txt - Lorand Bendig
How about from S3? Same as it is already on HDFS? Just wondering can you put your comment as an answer? So I can accept your answer :) - Simon Guo

1 Answers

0
votes

TextInputFormat handles gzip files as well. Have a look at its RecordReader's (LineRecordReader) initialize() method where the proper CompressionCodec is initialized. Also note that gzip files aren't splittable (even if they are located on S3) so you might either need to use a splittable format (e.g: LZO) or an uncompressed data to exploit the desired level of parallel processing.

If your gzipped data is stored locally you can uncompress and copy it to hdfs in one step as described here. Or if it's already on hdfs
hadoop fs -cat /data/data.gz | gzip -d | hadoop fs -put - /data/data.txt would be more convenient.