I wrote my UDF to load file into Pig. It works well for loading text file, however, now I need also be able to read .gz file. I know I can unzip the file then process, but I want just read .gz file without to unzip it.
I have my UDF extends from LoadFunc, then in my costom input file MyInputFile extends TextInputFormat. I also Implemented MyRecordReader. Just wondering if extends TextInputFormat is the problem? I tried FileInputFormat, still cannot read the file. Anyone wrote UDF read data from .gz file before?
TextInputFormatcan handle gzip files. Have a look at its RecordReader's (LineRecordReader)initialize()method where the proper CompressionCodec is initialized. Also note that gzip files aren't splittable. - Lorand BendigLZO). How to extract gzip file; local disk->HDFS, see: bigdatanoob.blogspot.hu/2011/07/… . If already on hdfs:hadoop fs -cat /data/data.gz | gzip -d | hadoop fs -put - /data/data.txt- Lorand Bendig