Experiencing slow streaming writes to BigQuery from Dataflow pipeline?

Question

I experience unexpected performance issues when writing to BigQuery with streaming inserts and Python SDK 2.23.

Without the write step the pipeline runs on one worker with ~20-30% CPU. Adding the BigQuery step the pipeline scales up to 6 workers all on 70-90% CPU.

I am pretty new to Dataflow and Beam and probably this behaviour is normal or I am doing something wrong but it seems to me that using 6 machines to write 250 rows per second to BigQuery is a bit heavy. I'm wondering how it is even possible to reach the insertion quota of 100K rows per second.

My pipeline looks like this:

p
    | "Read from PubSub" >> beam.io.ReadFromPubSub(subscription=options.pubsub_subscription) # ~40/s
    | "Split messages" >> beam.FlatMap(split_messages) # ~ 400/s
    | "Prepare message for BigQuery" >> beam.Map(prepare_row)
    | "Filter known message types" >> beam.Filter(filter_message_types) # ~ 250/s
    | "Write to BigQuery" >> beam.io.WriteToBigQuery(
          table=options.table_spec_position,
          schema=table_schema,
          write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
          create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
          additional_bq_parameters=additional_bq_parameters,
      )

The pipeline runs with these option although I experienced a similar behaviour without using streaming engine.

--enable_streaming_engine \
--autoscaling_algorithm=THROUGHPUT_BASED \
--max_num_workers=15 \
--machine_type=n1-standard-2 \
--disk_size_gb=30 \

Screenshot of metrics:

My Question is if this behaviour is normal or is there anything I can do to reduce the number of required workers for this pipeline. Thanks!

Update: Here is an image of the last step of the dataflow graph with wall times. (taken after the job was running for 1h). All the other steps before have very low wall times of only a few seconds.

Have you tried to use dedicated solution for your scenario, which is Dataflow template: cloud.google.com/dataflow/docs/guides/templates/… Let me know if you tried it. — aga
Thanks for your reply! I've seen the templates but we cannot use them because our incoming PubSub message contains multiple messages which need to be extracted (split by '\n' actually). Also we want to build und understand a custom pipeline to get a better understanding of Beam and Dataflow. — Philipp
This is surprisingly slow. Can you look at your msec counters and see in what stages the pipeline is spending its time? — robertwb
Thanks for your reply! I'm not sure what you mean with msec counters. I updated the question and added an image of the final step with and the corresponding wall times, hoping this is what you asked for. It seems like the final write is causing the delay. But I'm not sure how to investigate further from here. — Philipp

Philipp Philipp · Accepted Answer · 2020-09-23T07:20:36

After debugging a bit I found that there were some invalid messages which could not be written to BigQuery (and did not log an error). So for anyone who comes across a similar issue:

After changing the insert_retry_strategy of beam.io.WriteToBigQuery to RETRY_NEVER and printing out the deadletter pCollection I fixed the wrong formatted messages and the performance improved. I guess there some invalid messages were stuck due to the default strategy of RETRY_ALWAYS.

Experiencing slow streaming writes to BigQuery from Dataflow pipeline?

1 Answers