decompressing files from hdfs in spark

Question

I am using spark and I have different kind of compressed files on hdfs(zip,gzip,7zip,tar,bz2,tar.gz etc). Could anyone please let me know best way for decompression. For some compression I could use CompressionCodec. But it does not support all compression format.For zip file I did some search and found that ZipFileInputFormat could be used. but i could not find any jar for this.

You can write your own input format and record reader in java and import into scala. gist.github.com/jteso/1868049 — OneCricketeer
zip,7zip,tar are archives, not necessarily "compressed" as that of BZip2 and Gzip (gz and gzip are the same... tar.gz is a tar archive that is compressed). Anyways, BZip2 is the best option within HDFS comphadoop.weebly.com/index.html — OneCricketeer

BenFradet BenFradet · Accepted Answer · 2017-02-01T08:10:25

For some compressed format (I know that it is true for tar.gz and zip, haven't tested for the others), you can use the dataframe API directly and it'll take care of the compression for you:

val df = spark.read.json("compressed-json.tar.gz")

decompressing files from hdfs in spark

1 Answers