How can I run two parallel jobs on Google Dataproc

Question

I have one job that will take a long time to run on DataProc. In the meanwhile I need to be able to run other smaller jobs.

From what I could gather from the Google Dataproc documentation, the platform is supposed to support multiple jobs, since it uses YARN dynamic allocation for resources.

However, when I try to do launch multiple jobs, they get queued and one doesn't start until the cluster is free.

All settings are by default. How can I enable multiple jobs running at the same time?

tix tix · Accepted Answer · 2017-02-14T17:39:45

Dataproc indeed supports multiple concurrent jobs. However, its ability to host multiple jobs is dependent on Yarn having available capacity to host Application Master (or the job will be queued) or the actual workers (or the job will take a long time).

The number of containers that your larger job will request is dependent on number of partitions. With default settings, a Dataproc worker will support 2 Mapper or Reducer tasks. If you're processing 100 files and each file is a partition your entire cluster capacity is now allocated.

There's a few things you could do:

Run smaller jobs on a separate cluster. Your ideal cluster configuration is when one job occupies the entire cluster, or N jobs evenly sharing the cluster
Add extra workers to your current cluster and/or experiment with preemptible workers (you can use clusters update command for resizing, see)
(Advanced) Experiment with different Yarn schedulers (see for Fair scheduler with queues)

How can I run two parallel jobs on Google Dataproc

1 Answers