0
votes

I'm new in google dataflow. I have 2 dataflow pipeline to execute 2 difference job. One is ETL process and load to Bigquery and another one is read from Bigquery to aggregate for report. I want to run pipeline ETL firt and after it complete the reports pipeline will run to make sure data in bigquery is latest update.

I had tried to run in one pipe line but it can't help. Now I have to run manual for ETL first and then I run report pipeline.

Can any body give me some advice to run 2 job in one pipeline. Thanks.

1
I found the solution is: I build ETL process in one pipeline and Aggregate process in another pipeline, after that I export each pipeline to jar runnable file and I use shell script to run batch job schedule daily with Aggregate depend on status of ETL process.lknguyen

1 Answers

2
votes

You should be able to do both of these in a single pipeline. Rather than writing to BigQuery and then trying to read that back in and generate the report, consider just using the intermediate data for both purposes. For example:

PCollection<Input> input = /* ... */;
// Perform your transformation logic
PCollection<Intermediate> intermediate = input
  .apply(...)
  .apply(...);
// Convert the transformed results into table rows and
// write those to BigQuery.
intermediate
  .apply(ParDo.of(new IntermediateToTableRowETL())
  .apply(BigQueryIO.write(...));
// Generate your report over the transformed data
intermediate
  .apply(...)
  .apply(...);