I have created a cluster in Google Cloud Platform Dataproc with the code snippet below:
gcloud dataproc clusters create $SOLO \
--project $PROJ \
--bucket $STORAGE \
--region $REGION \
--image-version 1.4-ubuntu18 --single-node \
--master-machine-type n1-standard-8 \
--master-boot-disk-type pd-ssd --master-boot-disk-size 100 \
--initialization-actions gs://goog-dataproc-initialization-actions-$REGION/python/pip-install.sh \
From the Google documentation here, the n1-standard-8
has 8 vCPU.
I have a PySpark script, which contains the code below:
import pyspark
sc = pyspark.SparkContext.getOrCreate()
print(sc.defaultParallelism)
When I submit that PySpark script to the cluster, the job log shows that the Spark Content's default parallelism is 2.
Why the sc.defaultParallelism
returns 2, not 8?