Why spark content default parallelism is not same as number of vCPU?

Question

I have created a cluster in Google Cloud Platform Dataproc with the code snippet below:

gcloud dataproc clusters create $SOLO \
    --project $PROJ \
    --bucket $STORAGE \
    --region $REGION \
    --image-version 1.4-ubuntu18 --single-node \
    --master-machine-type n1-standard-8 \
    --master-boot-disk-type pd-ssd --master-boot-disk-size 100 \
    --initialization-actions gs://goog-dataproc-initialization-actions-$REGION/python/pip-install.sh \

From the Google documentation here, the n1-standard-8 has 8 vCPU.

I have a PySpark script, which contains the code below:

import pyspark
sc = pyspark.SparkContext.getOrCreate()
print(sc.defaultParallelism)

When I submit that PySpark script to the cluster, the job log shows that the Spark Content's default parallelism is 2.

Why the sc.defaultParallelism returns 2, not 8?

Henry Gong Henry Gong · Accepted Answer · 2020-05-26T17:48:06

According to the Spark doc, usually this parameter is only meaningful in the context of distributed shuffle operations. Even in that context, it's also up to what kind of operation it's doing, e.g. reduce/join/parallelize and not always retruns the number of cores on local machine.

Why spark content default parallelism is not same as number of vCPU?

2 Answers