Partitioning the RDD for Spark Jobs

Question

When I submit job spark in yarn cluster I see spark-UI I get 4 stages of jobs but, memory used is very low in all nodes and it says 0 out of 4 gb used. I guess that might be because I left it in default partition.

Files size ranges are betweenr 1 mb to 100 mb in s3. There are around 2700 files with size of 26 GB. And exactly same 2700 jobs were running in stage 2.

Is it worth to repartition something around 640 partitons, would it improve the performace? or
It doesn't matter if partition is granular than actually required? or
My submit parameters needs to be addressed?

Cluster details,

Cluster with 10 nodes
Overall memory 500 GB
Overall vCores 64

--excutor-memory 16 g
--num-executors 16
--executor-cores 1

Actually it runs on 17 cores out of 64. I dont want to increase the number of cores since others might use the cluster.

By the way, if you are using Yarn you can set the resources manager to FAIR — Alberto Bonsanto
The memory usage reported by the spark ui only reflects persisted RDD's. If your job never persists RDD's this value will remain 0 and doesn't reflect the memory being used by your job. Re-partitioning should be used for large files that need to be broken into smaller chunks for efficient use on your cluster, large compressed files for example, large uncompressed files will be split automatically. — kberg

YoYo YoYo · Accepted Answer · 2018-04-07T00:52:22

You partition, and repartition for following reasons:

To make sure we have enough work to distribute to the distinct cores in our cluster (nodes * cores_per_node). Obviously we need to tune the number of executors, cores per executor, and memory per executor to make that happen as intended.
To make sure we evenly distribute work: the smaller the partitions, the lesser the chance than one core might have much more work to do than all other cores. Skewed distribution can have a huge effect on total lapse time if the partitions are too big.
To keep partitions in managable sizes. Not to big, and not to small so we dont overtax GC. Also bigger partitions might have issues when we have non-linear O.
To small partitions will create too much process overhead.

As you might have noticed, there will be a goldilocks zone. Testing will help you determine ideal partition size.

Note that it is ok to have much more partitions than we have cores. Queuing partitions to be assigned a task is something that I design for.

Also make sure you configure your spark job properly otherwise:

Make sure you do not have too many executors. One or Very Few executors per node is more than enough. Fewer executors will have less overhead, as they work in shared memory space, and individual tasks are handled by threads instead of processes. There is a huge amount of overhead to starting up a process, but Threads are pretty lightweight.
Tasks need to talk to each other. If they are in the same executor, they can do that in-memory. If they are in different executors (processes), then that happens over a socket (overhead). If that is over multiple nodes, that happens over a traditional network connection (more overhead).
Assign enough memory to your executors. When using Yarn as the scheduler, it will fit the executors by default by their memory, not by the CPU you declare to use.

I do not know what your situation is (you made the node names invisible), but if you only have a single node with 15 cores, then 16 executors do not make sense. Instead, set it up with One executor, and 16 cores per executor.

Partitioning the RDD for Spark Jobs

1 Answers