Spark job using only few nodes in the cluster

Question

I am running a test job which takes zipped 5gb of data and dumps into mongoDB . I have 1 master and 3 slave each 16 CPU ,30gb RAM . After the job submission it seems like spark only uses 2 slave node for the job and assign 32 cores for the job although i am using dynamic allocation for my job.This job is the only running job on this cluster due to which i expected around 47 cores(1 left for application master yarn) to be used acorss 3 nodes .I am using AWS EMR and yarn in my cluster .

Is there a particular reason why only 2 nodes take part in the job and only 32 cores are allocated for the job using dynamic allocation .

Alper t. Turker Alper t. Turker · Accepted Answer · 2017-10-03T18:05:47

zip files are not splittable. If you don't unpack the file manually, it can be loaded only on a single machine.

Total number of tasks (200) suggests you are using SQL aggregations. This might be the first data is actually repartitioned and depending on the configuration, Spark might prefer better locality and lower number of occupied nodes.

I would strongly advise to unpack the file before it is used as an input for Spark.

Spark job using only few nodes in the cluster

1 Answers