2
votes

I'm using a Google Cloud Function to run an ETL jop:

  1. Get data from a JSON API
  2. Enrich every row of that data with another API
  3. Write to Cloud Storage

A cloud scheduler cron job runs every night to trigger the cloud function. I can also run the pipeline manually to query for a specific date. The cloud function is written in Python.

The job ran always close to 9 minutes, but it worked fine for a couple of months. Unfortunately now I'm hitting the 9 minute hard limit of Google Cloud Functions and I'm wondering what my best options would be:

  1. Set up a compute engine
  2. Set up an app engine
  3. Work on the cloud function to parrallelize it and save time.

Are there any better options? What GCP service would be well suited for this task? Do you have any best practices? I really like the simplicity of cloud functions, but this comes with a tradeoff of course...

1

1 Answers

5
votes

I recommend you to use Cloud Run.

  • The timeout is today at 15 minutes, and soon 4 time more! It will be enough for your processing.
  • If your code can leverage several CPU, you can have 2 CPU with Cloud Run.
  • However, if it's possible to have several processing in the same time, Cloud Run can handle up to 80 concurrent request on the same instance, Cloud Function only one. If you perform high computing on your instance, it's better to avoid concurrency. Set the --concurrency param to 1 for having the exact same behavior than with Cloud Function.

I wrote an article where I wrap a simple function into a Cloud Run service. Few lines of code, an addition import (flask) and that's all! Put a standard Dockerfile for python and deploy!

With the new Buildpack feature, you can even avoid to create a Dockerfile! Buildpack is installed on Cloud Shell and if you use Cloud Build, I have a working example if you want (let me know!)