Tuning Spark (YARN) cluster for reading 200GB of CSV files (pyspark) via HDFS

Question

I'm working with an 11-node cluster running on AWS right now (1 master, 10 workers - c3.4xlarge) and I'm attempting to read in ~200GB of .csv files (only about 10 actual .csv files) from HDFS.

This process is going really slowly. I'm watching spark at the command line and it looks like this..

[Stage 0:>                                                      (30 + 2) / 2044]

With an increase of +2 units (meaning 30+2 going to 32+2 to 34+2, etc..) of progress maybe every 20 seconds. So this is in serious need of improvement, or else we'll be here for about 11 hours before the files are even done reading.

This is the code to this point..

# AMAZON AWS EMR

def sparkconfig():
    conf = SparkConf()
    conf.setMaster("yarn-client)    #client gets output to terminals
    conf.set("spark.default.parallelism",340)
    conf.setAppName("my app")
    conf.set("spark.executor.memory", "20g")
    return conf


sc = SparkContext(conf=sparkconfig(),
             pyFiles=['/home/hadoop/temp_files/redis.zip'])

path = 'hdfs:///tmp/files/' 
all_tx = sc.textFile(my_path).coalesce(1024)
... more code for processing

Now clearly 1024 for partitions may not be correct, that was just from googling and trying different things. I'm really at a loss when it comes to tuning this job.

The worker nodes @ AWS are c3.4xlarge instances (I have 10 in the cluster) consist of 30GB of RAM with 16 vCPU's each. The HDFS partition is made up of the local storage for each node in the cluster, which is 2x160GB SSD's, so I believe we're looking at (2*160GB * 10nodes / 3 replication) = ~1TB of HDFS.

The .csv files themselves range from 5GB to 90GB in size.

To clarify in case it is relevant, the Hadoop cluster is the same as the spark cluster as far as the nodes go. I'm allocating 20GB of the 30GB per node to the spark executor, leaving 10GB per node for OS + Hadoop/YARN, etc.. The name-node / spark parent node is a m3.xlarge, which has 4 vcpu's and 16GB of RAM.

Does anyone have suggestions on tuning options (or anything, really) that I might try to speed up this file-read process?

Suggestion, convert them to parquet and split them carefully in the HDFS cluster being careful of the number of partitions — Alberto Bonsanto
Thank you! I'll look into the best method to convert the existing CSV files to parquet and see how well that improves performance. — nameBrandon

Rohit Karlupia Rohit Karlupia · Accepted Answer · 2018-05-25T05:34:02

Shameless Plug (Author) try Sparklens https://github.com/qubole/sparklens Most of the time the real question is not if the application is slow, but will it scale. And for most of the applications, answer is upto a limit.

The structure of spark application puts important constraints on its scalability. Number of tasks in a stage, dependencies between stages, skew and amount of work done on the driver side are the main constraints.

Tuning Spark (YARN) cluster for reading 200GB of CSV files (pyspark) via HDFS

1 Answers