3
votes

I have a pipeline which reads streaming data from Cloud Pub/Sub, this data is processed by Dataflow, then saved into one large BigQuery table, each Pub/Sub message includes an associated account_id. Is there a way to create new tables on the fly when a new account_id is identified? And then populate them with data from that associated account_id?

I know that this can be done by updating the pipeline for each new account. But in an ideal world, Cloud Dataflow would generate these tables within the code programmatically.

2
Do I understand correctly - you don't know the name of these BigQuery tables/sinks in your pipeline until runtime i.e. they are dynamic. Is that right?Graham Polley
Correct, I want to create the tables on runtime in a dynamic manner. I do know their names, but we get new accounts all the time and it is not time efficient to constantly update the pipeline for each account.Owen Monagan
And the issue with using one large table for all accounts is: cost & speed of iterating through every account's row data.Owen Monagan
Please also see this answer stackoverflow.com/questions/43505534/…jkff

2 Answers

1
votes

wanted to share few options I see

Option1 - wait for Partition on non-date field feature
It is not know when this is going to be implemented and available for us, so it might be not what you want now. But when this will go live - this will be the best option for such scenarios

Option 2 – you can come up with hashing your account_id into predefined number of buckets. In this case you can pre-create all those tables and in your code have logic that will handle respective destination table based on account hash. Same hashing logic than needs to be used in queries that will query that data

0
votes

The API for creating BigQuery Tables is at https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert.

Nevertheless, it would probably be easier if you store all accounts in one static table that contains account_id as one column.