4
votes

I'm using dataflow to process files stored in GCS and write to Bigquery tables. Below are my requirements:

  1. input files contain events records, each record pertains to one eventType;
  2. need to partition records by eventType;
  3. for each eventType output/write records to a corresponding Bigquery table, one table per eventType.
  4. events in each batch input files vary;

I'm thinking of applying transforms such as "groupByKey" and "partition", however seems that I have to know number of (and type of) events at the development time which is needed to determine the partitions.

Do you guys have a good idea to do the partitioning dramatically? meaning partitions can be determined at run time?

1
Hi! Providing more flexibility for customized I/O is a feature that is currently being worked on. This is a use case that will be kept in mind as this work progresses.MattL
Thanks Matt! when you think the feature can be ready?Echo
We cannot comment on the specific timeline at this point, but this is something that we are actively working on.Davor Bonaci
The API for defining custom output formats has landed in github - see github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/… and examples such as github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/… and github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/… . Does this help your use case?jkff
@Echo, sorry about the resurrection, but have you found a good solution for this? Specifically the part about handling dynamic events. We're doing something similar, and the most simple and cost-efficient approach we came up with at the moment is to download the file to a compute engine instance, partition it locally via script (python), upload partitioned files back to GCS, and then invoking bq commands per file to import to the relevant "event" table.DannyA

1 Answers

1
votes

Why not loading everything into a single "raw" bigquery table and then using BigQuery API determine the different number of events and export each event type to its own table (e.g., via https://cloud.google.com/bigquery/bq-command-line-tool#createtablequery) or an API call?

If your input format is simple, you can do that without using dataflow at all and it will be probably more cost efficient.