Performance improvements saving Spark ORC

Question

I'm, using Spark 1.6.1 and I'm still quite new in Spark world. I'm playing with saving file to ORC format.

I'm trying to parse relatively large text file (8 GB) into ORC. File typically is quite wide, i.e. 200+ columns.

Column types are basic: Int, String, Date. I parse all lines, then do persist() and save to the file.

Here is basic code:

val schema = StructType(
  myTableColumns.map(
    c => StructField(
//Field descriptions ~200 fields
)))

val rowRDD = rddProcessedLines.map(line => {
  Row.fromSeq(line)
})

val fileSchemaRDD = hiveContext.createDataFrame(rowRDD, schema)

fileSchemaRDD.registerTempTable("output_table_name")
fileSchemaRDD.write.orc("output_folder")

Problem is that performance is quite poor. It is worse then any import to relational database from same text file.

I tried to switch between Snappy and LZF compressors, not much gain here. I also played with memory size for nodes and number of cores, not better. Then I started changing buffer size, etc for compression. I see that performance is dramatically dropping for larger number of columns. Could somebody tell where to look? Can somebody point onto useful topics about ORC file save optimization?

Nakul Nakul · Accepted Answer · 2017-01-06T19:05:16

This slow performance is due to size of the file you're trying to load. To leverage the distributed computing of spark, make sure you have multiple small files to make the transformations more parallel.Try making your 8 GB file into multiple files of size 64 MB each. Also, from your code, you don't need to register the data frame to a temporary table before save since you're not using for any other transformations later.

Performance improvements saving Spark ORC

1 Answers