2
votes

I am currently using a Dataproc cluster with a fixed number of workers. Each worker has a non-trivial initialization action needed to install some specific libraries on the workers.

Recently, we decided to try to use some preemptible workers, but our Spark jobs are failing because some libraries are missing. The reason seems to be that there is no initialization actions on the preemptible workers. In fact, I have connected using ssh to these workers and I am completely sure that the initialization script is not executed on these preemptible workers, because the expected libraries are not there and our initialization script leaves a log of its execution, which is missing.

Is this a normal situation? How can I ensure that my preemptible workers have run my custom initialization action script?

1
Are you using an initialization action based on one from github.com/GoogleCloudPlatform/dataproc-initialization-actions or is it totally custom for your use case? - Dennis Huo
@DennisHuo it's a custom one. I've added a comment on the accepted response; the problem was that my script was failing on the preemptible worker. - YuppieNetworking

1 Answers

2
votes

This is definitely not normal. Dataproc should ensure the node does not join the cluster until it is fully initialized (along with other guarantees).

My best guess is that the repository could be flaky or overloaded and the actual step to install library fails but the overall script does not. Could you try adding set -e at the top of your init action?

You can also SSH into the node and inspect the log of the init action in /var/log/dataproc-startup-script*.