1
votes

I have a small daily computing job that imports data from BigQuery, uses Python numerical computing libraries (pandas, numpy) to process and then write results to an external table (Firestore or MySQL at another project)

What is the recommended way to deploy it on GCP?

Our devops advice us against creating a single vm just for doing batch job. They would prefer not to manage the VM infrastructure themselves, and there should be services born to support batch job. They insist that I use Dataflow. But I think Dataflow distributed nature is a little bit overkill.

Many thanks,


Updated October 14, 2019:

I'm thinking about dockerizing the batch job and deploy to a K8 cluster. The downside is that the cluster should host several jobs to worth the setup and maintaining effort. Can someone give me advice on the feasibility and suitability of this approach?


Updated October 15, 2019:

Thanks Alex Titov for his comment at https://googlecloud-community.slack.com/archives/C0G6VB4UE/p1571032864020000. Based on his suggestion, I'm going to break my job into multiple small Cloud Functions components and chain them together as a pipeline by Cloud Scheduler and/or Cloud Composer.

2
What are your requirement? Constraint? How many memory do you need? How long is your run? what's the retry policy? There is a lot a solution, but according with this element, the right one could be advice! - guillaume blaquiere
thanks @guillaumeblaquiere. The size of the batch job is well fitted in a VM with Python and its data processing and machine learning libraries installed. The run duration can be up to 1-2 hour. Retry can be handled at application logic, while failure at spawning the VM should send an alert to the job owner. - Quy Dinh

2 Answers

2
votes

Cloud Dataflow does exactly what you are looking for, so it's much easier to manage, scale and build than a VM. Ask yourself only a few questions beforehand and if they don't apply, use Dataflow:

  • Do I want to be restricted to a specific Cloud Provider (GCP in this case)
  • In this project, are other cloud services used or they just use infrastructure from the Cloud (keeping consistency). Also in what direction do we want the project to go? (use custom or cloud solutions)
  • Do I want absolute control of this batch software processing tool? If so, you may not have it with Dataflow
  • Other considerations, like cost, deployment time, ramp-up time

If all answers incline towards cloud service, then use that.

0
votes

If you containerize your job, there is 2 serverless solutions for running it. A day, a 3rd will be available when Cloud Run could last more than 15 minutes (in roadmap but without release date)

  1. Use Cloud Build. Think to set correctly the timeout. In fact, Cloud Build is design for running any container. I wrote an article on this

  2. Use AI-Platform. A (great) Googler has released an article on this

Both solution are great and you can choose the machine type of underlying VM which run your container. Thanks to this, you don't have to manage a K8S cluster and pay for it when it isn't used.