1
votes

I have created a cluster in Google Cloud Platform Dataproc with the code snippet below:

gcloud dataproc clusters create $SOLO \
    --project $PROJ \
    --bucket $STORAGE \
    --region $REGION \
    --image-version 1.4-ubuntu18 --single-node \
    --master-machine-type n1-standard-8 \
    --master-boot-disk-type pd-ssd --master-boot-disk-size 100 \
    --initialization-actions gs://goog-dataproc-initialization-actions-$REGION/python/pip-install.sh \

From the Google documentation here, the n1-standard-8 has 8 vCPU.

I have a PySpark script, which contains the code below:

import pyspark
sc = pyspark.SparkContext.getOrCreate()
print(sc.defaultParallelism)

When I submit that PySpark script to the cluster, the job log shows that the Spark Content's default parallelism is 2.

Why the sc.defaultParallelism returns 2, not 8?

2

2 Answers

0
votes

According to the Spark doc, usually this parameter is only meaningful in the context of distributed shuffle operations. Even in that context, it's also up to what kind of operation it's doing, e.g. reduce/join/parallelize and not always retruns the number of cores on local machine.

0
votes

Hijacking Henry's answer and based on my little knowledge of parallel computing n1-standard-8 is what you can have at the max unless the job scheduler allows dynamically allocating (and not increasing) the resources to your job from the pool of available resources. Some jobs may require less than what's initially quoted and are accordingly allocated anything more than 1 to less than 8.