I am working on an application where user submits requests and these requests will be processed as a spark job. Currently we have a very big cluster in our data center catering to the needs of the organization. We are planning to move to GCP and in an attempt to reduce the costs, we planned to move to dynamic clustering. Since the sizing of cluster depends heavily on the user activity we are planning for a complete auto scaling cluster.
One of the problems is our user requests are bound to SLA and request processing times will be around 10 to 15 minutes. Unfortunately with dynamic clustering its adding another 5 to 6 minutes for the cluster to come up and also addition of workers nodes as a part of auto scaling is also taking large times.
Even though I have very few initialization steps, as a measure I have created a custom image with pre-installed library set required for my PySpark job and using that image to start the cluster. For test purpose I am creating very basic 2 node cluster which is also taking 4 to 6 minutes.
I am not even installing additional "optional-components" also.
Here is the command I used for image creation:
python generate_custom_image.py \
--image-name custom-1-5-1-debina10 \
--family custom-image \
--dataproc-version 1.5.1-debian10 \
--customization-script initialization_scripts_for_image.sh \
--zone europe-west3-b \
--gcs-bucket gs://poc-data-store/custom-image-logs/ \
--disk-size 50 \
--dry-run
Are there any suggestions where I can improve the Dataproc cluster start up time. One observation, Dataproc startup log is spending much of its time in uninstalling components:
Is there any possibility to push as much as possible to the image preparation phase reducing only start of services to the cluster start up phase?