Use fastavro when reading from BigQuery using Apache Beam/Dataflow

Question

My project is running Python 2.7 (yes, I know...) with Apache Beam 2.19 on Google Dataflow. We're connecting to BigQuery in the same way that's specified in the Apache Beam tutorial:

p | 'Get data from BigQuery' >> beam.io.Read(beam.io.BigQuerySource(
    query=get_query(limit),
    use_standard_sql=True)))

However, the read-step of this Pipeline is incredibly slow - most likely due to the reading of .avro-files. It doesn't seem like fastavro is actually being used, though. AFAIK, you need to set the use_fastavro flag explicitly when running on Python <3.7. Is that even possible with this setup? Or will I need to, manually, export to GCS first?

Alex Amato Alex Amato · Accepted Answer · 2020-03-19T18:14:16

Which version of beam are you running? At beam version 2.6+, you can enable use_fastavro as shown here:

Dataflow Python SDK Avro Source/Sync

It looks like for more recent beam SDKs fastavro defaults to true if you are using python 3+, but you can still manually enable it https://github.com/apache/beam/blob/4743e131edadad42555e605be803e26cb37b7ce6/sdks/python/apache_beam/io/avroio.py#L81

I don't see any warnings about not settings this for prior python versions, so you may want to give this a try and see if this works.

Use fastavro when reading from BigQuery using Apache Beam/Dataflow

1 Answers