0
votes

Basicly we want to split a big (billions of rows) bigquery table into a large number (can be around 100k) smaller tables based on the value of a particular column (not date). I can't figure out how to do it efficiently in bigquery itself, so I am thinking of using dataflow.

With dataflow, we can first load the data from , then create a key value pair for each record, the key is all the possible values for the particular column we want to split the table, then we can group the records by the key. so after this operation, we have PCollection of the (key, [records]). we would then need to write PCollection back to bigquery table, the table name can be key_table.

So the operation would be: p | beam.io.Read(beam.io.BigQuerySource()) | beam.map(lambda record : (record['splitcol'], record)) | beam.GroupByKey() | beam.io.Write(beam.io.BigQuerySink)

The key question now is how do I write to different tables in the last step based on the value in each element in PCollection.

This question is somehow related to the another question: Writing different values to different BigQuery tables in Apache Beam. But I am a python guy, not sure if the same solution is possible in Python SDK also.

1

1 Answers

0
votes

Currently this feature (value-dependent BigQueryIO.write()) is only supported in Beam Java. Unfortunately I can't think of an easy way to mimic it using Beam Python, short of reimplementing the respective Java code. Please feel free to open a JIRA feature request.

I guess the simplest thing that comes to mind is writing a DoFn to manually write your rows to the respective tables, using the BigQuery streaming insert API (rather than the Beam BigQuery connector), however keep in mind that streaming inserts are more expensive and subject to more strict quota policies than bulk imports (which are used by the Java BigQuery connector when writing a bounded PCollection).

There is also work happening in Beam on allowing reuse of transforms across languages - a design is being discussed at https://s.apache.org/beam-mixed-language-pipelines. When that work is completed, you would be able to use the Java BigQuery connector from a Python pipeline.