7
votes

I am reading 10 million records from BigQuery and doing some transformation and creating the .csv file, the same .csv stream data I am uploading to SFTP server using Node.JS.

This job taking approximately 5 to 6 hrs to complete the request locally.

Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.

Please find below configuration of GCP Cloud Run.

Autoscaling: Up to 1 container instances CPU allocated: default Memory allocated: 2Gi Concurrency: 10 Request timeout: 900 seconds

Is GCP Cloud Run is good option for long running background process?

5
You're using the wrong tool. Cloud Run is not a good fit for this. Try Cloud Dataflow instead.Graham Polley
Is it possible to upload file in Cloud Dataflow steps? @graham-polleymayur nimavat
Upload the file first to Cloud Storage. Cloud Dataflow reads files from Cloud Storage.John Hanley
Do you want to keep your container?guillaume blaquiere
@guillaumeblaquiere, Yes I want to keep container idle for long period time for processing request in background.mayur nimavat

5 Answers

2
votes

You can try using an Apache Beam pipeline deployed via Cloud Dataflow. Using Python, you can perform the task with the following steps:

Stage 1. Read the data from BigQuery table.

beam.io.Read(beam.io.BigQuerySource(query=your_query,use_standard_sql=True))

Stage 2. Upload Stage 1 result into a CSV file on a GCS bucket.

beam.io.WriteToText(file_path_prefix="", \
                    file_name_suffix='.csv', \
                    header='list of csv file headers')

Stage 3. Call a ParDo function which will then take CSV file created in Stage 2 and upload it to the SFTP server. You can refer this link.

4
votes

You can use a VM instance with your container deployed and perform you job on it. At the end kill or stop your VM.

But, personally, I prefer serverless solution and approach, like Cloud Run. However, Long running job on Cloud Run will come, a day! Until this, you have to deal with the limit of 60 minutes or to use another service.

As workaround, I propose you to use Cloud Build. Yes, Cloud Build for running any container in it. I wrote an article on this. I ran a Terraform container on Cloud Build, but, in reality, you can run any container.

Set the timeout correctly, take care of default service account and assigned role, and, thing not yet available on Cloud Run, choose the number of CPUs (1, 8 or 32) for the processing and speed up your process.

Want a bonus? You have 120 minutes free per day and per billing account (be careful, it's not per project!)

2
votes

Is GCP Cloud Run is good option for long running background process?

Not a good option because your container is 'brought to life' by incoming HTTP request and as soon as the container responds (e.g. sends something back), Google assumes the processing of the request is finished and cuts the CPU off.

Which may explain this:

Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.

1
votes

You may consider a serverless, event-driven approach:

  • configure google storage trigger cloud function running transformation
  • extract/export BigQuery to CF trigger bucker - this is the fastest way to get BigQuery data out

Sometimes exported data in that way may be too large not be suitable in that form for Cloud Function processing, due to restriction like max execution time (9 min currently) or memory limitation 2GB, In that case, you can split the original data file to smaller pieces and/or push then to Pub/Sub with storage mirror

All that said we've used CF to process a billion records from building bloom filters to publishing data to aerospike under a few minutes end to end.

0
votes

I will try to use Dataflow for creating .csv file from Big Query and will upload that file to GCS.