Optimize Dataproc cluster startup time

Question

I am working on an application where user submits requests and these requests will be processed as a spark job. Currently we have a very big cluster in our data center catering to the needs of the organization. We are planning to move to GCP and in an attempt to reduce the costs, we planned to move to dynamic clustering. Since the sizing of cluster depends heavily on the user activity we are planning for a complete auto scaling cluster.

One of the problems is our user requests are bound to SLA and request processing times will be around 10 to 15 minutes. Unfortunately with dynamic clustering its adding another 5 to 6 minutes for the cluster to come up and also addition of workers nodes as a part of auto scaling is also taking large times.

Even though I have very few initialization steps, as a measure I have created a custom image with pre-installed library set required for my PySpark job and using that image to start the cluster. For test purpose I am creating very basic 2 node cluster which is also taking 4 to 6 minutes.

I am not even installing additional "optional-components" also.

Here is the command I used for image creation:

python generate_custom_image.py \
    --image-name custom-1-5-1-debina10 \
    --family custom-image \
    --dataproc-version 1.5.1-debian10 \
    --customization-script initialization_scripts_for_image.sh \
    --zone europe-west3-b \
    --gcs-bucket gs://poc-data-store/custom-image-logs/ \
    --disk-size 50 \
    --dry-run

Are there any suggestions where I can improve the Dataproc cluster start up time. One observation, Dataproc startup log is spending much of its time in uninstalling components:

Is there any possibility to push as much as possible to the image preparation phase reducing only start of services to the cluster start up phase?

Igor Dvorzhak Igor Dvorzhak · Accepted Answer · 2020-05-15T15:44:06

This is a known GCE issue with slow startup time of Debian 10 VMs with large boot disks. This issue is caused by slow file system resize during GCE VM boot time.

GCE team is working on fixing this issue, but there are no ETA yet.

Meanwhile, you have couple options as a workaround:

use Ubuntu-based Dataproc images
use smaller boot disk size which will reduce boot time, but this is not advised as it can impact performance, so you may need to attach Local SSDs to account for this.
create custom Dataproc image with 1TB boot disk so there will be no file system resize during boot time.

Dataproc uninstalls components asynchronously so it should not impact startup time significantly.

Update:

GCE issus is fixed so Dataproc clusters with default configuration that use recent Debian 10 images have average cluster creation time of 90 seconds.

Optimize Dataproc cluster startup time

1 Answers

Update: