0
votes

I am loading a bunch of files from Azure storage into pig. Pig has default support for gzip so if the file extensions are .gz everything works fine.

Problem is that older files are stored with .zip extension (I have millions of those).

Is there a way to tell pig to load files and treat .zip as gzip?

1

1 Answers

0
votes

I really don't know some other options are available but you can try something like this

  1. write a bash script which will convert the given zip file to gz file
  2. load the gz file in pig

Just a sample example for one file, you may need to change the script according to your need.

input.zip
1,john
2,cena
3,rock
4,sam

test.sh
#!/bin/bash
FILE_NAME=$(echo $1 | cut -d '.' -f1)
unzip  "$1"
tar czf "$FILE_NAME.gz" "$FILE_NAME"
pig -x local -param PIG_INPUT_FILE="$FILE_NAME.gz" -f myscript.pig

myscript.pig
A = LOAD '$PIG_INPUT_FILE' USING PigStorage(',');
DUMP A;

Output:

$ ./test.sh input.zip

(1,john)
(2,cena)
(3,rock)
(4,sam)

The other possible option is you may need to write a UDF to convert zip to gz using java.util.zip library and call LoadFunc option. I didn't try this option but if you want you can give a try.