Writing Spark dataframe in ORC format with Snappy compression

Question

I am successfull in reading a text file stored in S3 and writing it back to S3 in ORC format using Spark dataframes. - inputDf.write().orc(outputPath);
What I am not able to do is convert to ORC format with snappy compression. I already tried giving option while writing as setting the codec to snappy but Spark is still writing as normal ORC. How to achieve writing in ORC format with Snappy compression to S3 using Spark Dataframes?

The default (zlib) might be better than Snappy anyway: community.hortonworks.com/questions/4067/… — Mark Rajcok
@MarkRajcok Thanks, which means I can compress ORC format using .option only if I am using Spark 2.0 . Is there any other way you can suggest where I can compress the output. I am on Amazon EMR with Spark 1.6 — abstractKarshit
I haven't found a way to write a dataframe out as ORC-snappy on Spark 1.x. — Mark Rajcok

abstractKarshit abstractKarshit · Accepted Answer · 2016-10-05T13:35:42

For anyone facing the same issue, in Spark 2.0 this is possible by default. The default compression format for ORC is set to snappy.

public class ConvertToOrc {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("OrcConvert")
                .getOrCreate();
        String inputPath = args[0];
        String outputPath = args[1];

        Dataset<Row> inputDf = spark.read().option("sep", "\001").option("quote", "'").csv(inputPath);
        inputDf.write().format("orc").save(outputPath);

   }
}

Writing Spark dataframe in ORC format with Snappy compression

1 Answers