4
votes

I want to read gzip compressed files into a RDD[String] using an equivalent of sc.textFile("path/to/file.Z").

Except my file extension if not gz but is Z instead, so the file is not recognised as being gzipped.

I cannot rename them as it would break production code. I do not want to copy them as they are massive and many. I guess I could use some kind of symlinks but I want to see if there is a way with scala/spark first (I am on my local windows machine for now).

How can I read this file efficiently?

1
Also related: There is an issue on the Spark tracker about adding a way to explicitly specify a compression codec when reading files, so Spark doesn't infer it from the file extension.Nick Chammas

1 Answers

6
votes

Here there's a workaround to fix this problem http://arjon.es/2015/10/02/reading-compressed-data-with-spark-using-unknown-file-extensions/

The relevant section:

...extend GzipCodec and override the getDefaultExtension method.

package smx.ananke.spark.util.codecs

import org.apache.hadoop.io.compress.GzipCodec

class TmpGzipCodec extends GzipCodec {

  override def getDefaultExtension(): String = ".gz.tmp" // You should change it to ".Z"

}

Now we just registered this codec, setting spark.hadoop.io.compression.codecs on SparkConf:

val conf = new SparkConf()

// Custom Codec that process .gz.tmp extensions as a common Gzip format
conf.set("spark.hadoop.io.compression.codecs", "smx.ananke.spark.util.codecs.TmpGzipCodec")

val sc = new SparkContext(conf)

val data = sc.textFile("s3n://my-data-bucket/2015/09/21/13/*")