I'm working with an 11-node cluster running on AWS right now (1 master, 10 workers - c3.4xlarge) and I'm attempting to read in ~200GB of .csv files (only about 10 actual .csv files) from HDFS.
This process is going really slowly. I'm watching spark at the command line and it looks like this..
[Stage 0:> (30 + 2) / 2044]
With an increase of +2 units (meaning 30+2 going to 32+2 to 34+2, etc..) of progress maybe every 20 seconds. So this is in serious need of improvement, or else we'll be here for about 11 hours before the files are even done reading.
This is the code to this point..
# AMAZON AWS EMR
def sparkconfig():
conf = SparkConf()
conf.setMaster("yarn-client) #client gets output to terminals
conf.set("spark.default.parallelism",340)
conf.setAppName("my app")
conf.set("spark.executor.memory", "20g")
return conf
sc = SparkContext(conf=sparkconfig(),
pyFiles=['/home/hadoop/temp_files/redis.zip'])
path = 'hdfs:///tmp/files/'
all_tx = sc.textFile(my_path).coalesce(1024)
... more code for processing
Now clearly 1024 for partitions may not be correct, that was just from googling and trying different things. I'm really at a loss when it comes to tuning this job.
The worker nodes @ AWS are c3.4xlarge instances (I have 10 in the cluster) consist of 30GB of RAM with 16 vCPU's each. The HDFS partition is made up of the local storage for each node in the cluster, which is 2x160GB SSD's, so I believe we're looking at (2*160GB * 10nodes / 3 replication) = ~1TB of HDFS.
The .csv files themselves range from 5GB to 90GB in size.
To clarify in case it is relevant, the Hadoop cluster is the same as the spark cluster as far as the nodes go. I'm allocating 20GB of the 30GB per node to the spark executor, leaving 10GB per node for OS + Hadoop/YARN, etc.. The name-node / spark parent node is a m3.xlarge, which has 4 vcpu's and 16GB of RAM.
Does anyone have suggestions on tuning options (or anything, really) that I might try to speed up this file-read process?
parquet
and split them carefully in theHDFS
cluster being careful of the number of partitions – Alberto Bonsanto