What I would like to do in the pipeline:
- Read from pub/sub (done)
- Transform this data to dictionary (done)
- Take the value of a specified key from the dict (done)
Run a parametrized/dynamic query from BigQuery in which the where part should be like this:
SELECT field1 FROM Table where field2 = @valueFromP/S
The pipeline
| 'Read from PubSub' >> beam.io.ReadFromPubSub(subscription='')
| 'String to dictionary' >> beam.Map(lambda s:data_ingestion.parse_method(s))
| 'BigQuery' >> <Here is where I'm not sure how to do it>
The normal way to read from BQ it would be like:
| 'Read' >> beam.io.Read(beam.io.BigQuerySource(
query="SELECT field1 FROM table where field2='string'", use_standard_sql=True))
I have read about parameterized queries but i'm not sure if this would work with apache beam.
It could be done using side inputs?
Which would be the best way to do this?
What I've tried:
def parse_methodBQ(input):
query=''SELECT field1 FROM table WHERE field1=\'%s\' AND field2=True' % (input['field1'])'
return query
class ReadFromBigQuery(beam.PTransform):
def expand(self, pcoll):
return (
pcoll
| 'FormatQuery' >> beam.Map(parse_methodBQ)
| 'Read' >> beam.Map(lambda s: beam.io.Read(beam.io.BigQuerySource(query=s)))
)
with beam.Pipeline(options=pipeline_options) as p:
transform = (p | 'BQ' >> ReadFromBigQuery()
The result (why this?):
<Read(PTransform) label=[Read]>
The correct result should be like:
{u'Field1': u'string', u'Field2': Bool}
THE SOLUTION
In the pipeline:
| 'BQ' >> beam.Map(parse_method_BQ))
The function (using the BigQuery 0.25 API for dataflow)
def parse_method_BQ(input):
client = bigquery.Client()
QUERY = 'SELECT field1 FROM table WHERE field1=\'%s\' AND field2=True' % (input['field1'])
client.use_legacy_sql = False
query_job = client.run_async_query(query=QUERY ,job_name='temp-query-job_{}'.format(uuid.uuid4())) # API request
query_job.begin()
while True:
query_job.reload() # Refreshes the state via a GET request.
if query_job.state == 'DONE':
if query_job.error_result:
raise RuntimeError(query_job.errors)
rows = query_job.results().fetch_data()
for row in rows:
if not (row[0] is None):
return input
time.sleep(1)