I'm using dataflow to process files stored in GCS and write to Bigquery tables. Below are my requirements:
- input files contain events records, each record pertains to one eventType;
- need to partition records by eventType;
- for each eventType output/write records to a corresponding Bigquery table, one table per eventType.
- events in each batch input files vary;
I'm thinking of applying transforms such as "groupByKey" and "partition", however seems that I have to know number of (and type of) events at the development time which is needed to determine the partitions.
Do you guys have a good idea to do the partitioning dramatically? meaning partitions can be determined at run time?