Spark read performance difference in same size but with different row lengths

Question

I am using spark sql to read 2 different datasets sitting in ORC format in S3 . But the performance difference in reading is huge for almost similar size datasets.

Dataset 1 : contains 212,000,000 records each with 50 columns and total up to 15GB in orc format in s3 bucket .

Dataset 2 : contains 29,000,000 records each with 150 columns and total up to 15GB in orc format in same s3 bucket .

Dataset 1 is taking 2 mins to read using spark sql. And its taking 12 mins to read Dataset 2 with same spark read/count job in same infrastructure.

Length of each row could cause this big difference. Can anyone help me understand the reason behind huge performance difference in reading these datasets ?

stevel stevel · Accepted Answer · 2017-10-05T10:10:14

Assuming you are using the s3a: client (and not Amazon EMR & it's s3:// client) it is about how much seek() work is going on and whether the client is being clever about random IO or not. Essentially: seek() is very expensive over HTTP1.1 GETs if you have to close an HTTP connection and create a new one. Hadoop 2.8+ adds two features for this: HADOOP-14244: Lazy seek, and HADOOP-13203. High performance random IO.

If you have the Hadoop 2.8.+ JARs on your classopath, go:

spark.hadoop.fs.s3a.experimental.input.fadvise random

This will hurt performance on non-random IO (reading .gz files and the like), but is critical for ORC/Parquet IO perf.

If you are using Amazon EMR, their s3 client is closed source, take it up with their support team I'm afraid.

Spark read performance difference in same size but with different row lengths

1 Answers