I am using spark and I have different kind of compressed files on hdfs(zip,gzip,7zip,tar,bz2,tar.gz etc). Could anyone please let me know best way for decompression. For some compression I could use CompressionCodec. But it does not support all compression format.For zip file I did some search and found that ZipFileInputFormat could be used. but i could not find any jar for this.
0
votes
You can write your own input format and record reader in java and import into scala. gist.github.com/jteso/1868049
– OneCricketeer
zip,7zip,tar are archives, not necessarily "compressed" as that of BZip2 and Gzip (gz and gzip are the same... tar.gz is a tar archive that is compressed). Anyways, BZip2 is the best option within HDFS comphadoop.weebly.com/index.html
– OneCricketeer