1
votes

I'd like to test my pipeline. My pipeline extract data from BigQuery, then store data to GCS and S3. Although there are some information about pipeline test here, https://cloud.google.com/dataflow/pipelines/testing-your-pipeline, it does not include about data model of extracting data from BigQuery.

I found following example for it, but it lacks of comment, so little bit difficult to understand. https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/test/java/com/google/cloud/dataflow/examples/cookbook/BigQueryTornadoesTest.java

Are there any good documents for test my pipeline?

1

1 Answers

1
votes

In order to properly integration test your entire pipeline, please create a small amount of sample data stored in BigQuery. Also, please create a sample bucket/folder in S3 and GCS to store your output. Then run your pipeline as you normally would, using PipelineOptions to specify the test BQ table. You can use the DirectPipelineRunner if you want to run locally. It will probably be easiest to create a script which will first run the pipeline, then down the data from S3 and GCS and verify you see what you expect.

If you want to just test your pipeline's transforms on some offline data, then please follow the WordCount example.