What performance parameters to set for spark scala code to run on yarn using spark-submit?

Question

My use case is to merge two tables where one table contains 30 million records with 200 cols and another table contains 1 million records with 200 cols.I am using broadcast join for small table.I am loading both the tables as data-frames from hive managed tables on HDFS.

I need the values to set for driver memory and executor memory and other parameters along with it for this use case.

I have this hardware configurations for my yarn cluster :

Spark Version 2.0.0

Hdp version 2.5.3.0-37

1) yarn clients 20

2) Max. virtual cores allocated for a container (yarn.scheduler.maximum.allocation-vcores) 19

3) Max. Memory allocated for a yarn container 216gb

4) Cluster Memory Available 3.1 TB available

Any other info you need I can provide for this cluster.

I have to decrease the time to complete this process.

I have been using some configurations but I think its wrong, it took me 4.5 mins to complete it but I think spark has capability to decrease this time.

No @anshul_cached I am just using " spark.table("table_name") " code to load the table as a dataframe and then proceeding the frames for merging.After merging I am writing the table back to the hive storing it as ORC. — ShuBham ShaRma
Cache your dataframes into RAM and perform count action and then apply the rest — anshul_cached
I tried to cache both the dataframes then its taking more time while cahcing the dataframes and then processing it. @anshul_cached — ShuBham ShaRma
Caching is lazy. Are you performing count action on the dataframes after caching? — anshul_cached

code code · Accepted Answer · 2017-05-04T13:29:23

There are mainly two things to look at when you want to speed up your spark application.

Caching/persistance:

This is not a direct way to speed up the processing. This will be useful when you have multiple actions(reduce, join etc) and you want to avoid the re-computation of the RDDs in the case of failures and hence decrease the application run duration.

Increasing the parallelism:

This is the actual solution to speed up your Spark application. This can be achieved by increasing the number of partitions. Depending on the use case, you might have to increase the partitions

Whenever you create your dataframes/rdds: This is the better way to increase the partitions as you don't have to trigger a costly shuffle operation to increase the partitions.
By calling repartition: This will trigger a shuffle operation.

Note: Once you increase the number of partitions, then increase the executors(may be very large number of small containers with few vcores and few GBs of memory

Increasing the parallelism inside each executor By adding more cores to each executor, you can increase the parallelism at the partition level. This will also speed up the processing.

To have a better understanding of configurations please refer this post

What performance parameters to set for spark scala code to run on yarn using spark-submit?

1 Answers