1
votes

actually the following steps to my data:

new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.

I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.

Obs:

  • I know about BigQuery alpha trigger for Google Cloud Function but i
    dont know if is a good idea,from what I saw this trigger uses the job id, which from what I saw can not be fixed and whenever running a job apparently would have to deploy the function again. And of course
    it's an alpha solution.
  • I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
    indicates that the job finished.
  • My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
2
Actually i noticed that i need the "Private Logs Viewer" role to see jobcompleted log in Logging. So actually i am tending to use the stackdriver logging solution.Samuel Neves
You want to run a dataflow for each integrated file? Or after the load of all your files?guillaume blaquiere
What I wound do, even if not perfect is create a cloud function that runs every once in a while and checks the job id via bq api to see if it completed. If so, run the dataflow pipeline with dataflow admin sdk. You can use the datastore as a queue for tracking your job ids and cloud sheduler to create your cron job.Pievis
Have you considered doing all of the steps in one Dataflow pipeline? Then it is pretty easy to do things in sequential steps.Kenn Knowles
@guillaumeblaquiere run dataflow for each table generated by file(with custom query), my objective.Samuel Neves

2 Answers

1
votes

You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.

You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples

1
votes

Despite your mention about Stackdriver logging, you can use it with this filter

resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"

You can add dataset filter in addition if needed.

Then create a sink into Function on this advanced filter and run your dataflow job.

If this doesn't match your expectation, can you detail why?