Basicly we want to split a big (billions of rows) bigquery table into a large number (can be around 100k) smaller tables based on the value of a particular column (not date). I can't figure out how to do it efficiently in bigquery itself, so I am thinking of using dataflow.
With dataflow, we can first load the data from , then create a key value pair for each record, the key is all the possible values for the particular column we want to split the table, then we can group the records by the key. so after this operation, we have PCollection of the (key, [records]). we would then need to write PCollection back to bigquery table, the table name can be key_table.
So the operation would be: p | beam.io.Read(beam.io.BigQuerySource()) | beam.map(lambda record : (record['splitcol'], record)) | beam.GroupByKey() | beam.io.Write(beam.io.BigQuerySink)
The key question now is how do I write to different tables in the last step based on the value in each element in PCollection.
This question is somehow related to the another question: Writing different values to different BigQuery tables in Apache Beam. But I am a python guy, not sure if the same solution is possible in Python SDK also.