2
votes

I want to schedule my train model on dataproc, I need :

1/ request Bigquery and load on the BQ table for my dataset 2/ Create my dataproc cluster 3/ launch my pyspark job 4/ delete my cluster

I 'd like create a cron for this, how i could to make this ?

thanks

1
When do you need to run the cron job from? Is there any obstacles to running an ordinary (e.g., linux) cron job? - reprogrammer
Firstly all, every six hours ... I could use a cron on a VM (linux ... ), it's my solution actually ... but i was thinking can use a solution without VM. (with cron job GAE, i have limited because it 's too long).. - deefree
Are you facing any limitations in using GAE cron service cloud.google.com/solutions/… ? - reprogrammer

1 Answers

2
votes

There are a few different options here for you, depending how complicated or in-depth you want to get.

Very simple use

If you want to run things in as simple way as possible, you could probably get away with creating a shell script which simply invokes the Cloud SDK commands, such as:

#!/bin/sh
CREATION_OUTPUT=`gcloud dataproc clusters create ...`
...
DELETION_OUTPUT=`gcloud dataproc clusters delete...`

If you want to schedule it, the easiest way (IMHO, everyone may have a different opinion) would be to let it run on a f1-micro instance. The total cost for that would probably be about $5/month.

More advanced

You might want to use shell + variables so you do not need to hard code everything. For example, you can create clusters with a unique ID based on the time or some other value. This may be useful for you, especially if you want to create clusters often

But...

Both of these approaches are far from safe. If you get an error, for instance, your entire setup may enter and remain in a bad state. You will probably not know something has happened nor will you capture detailed debugging information.

A better solution

Using the APIs will take a bit more work but is probably a better overall solution, especially if you want to run this repeatedly or have error handling. In this case, I'd probably use Python to quickly write a script to talk to the APIs. This would let me capture errors, handle them, and probably recover (and notify, if you want.)

Here are some examples of our APIs being used with Python:

Directions for these are here, respectively: