4
votes

As far i know, streaming data to BigQuery would cause duplicate rows as it mentions here https://cloud.google.com/bigquery/streaming-data-into-bigquery#real-time_dashboards_and_queries

On the other hand, uploading data to PubSub and then using data flow to insert data to Bigquery will prevent the duplicate rows?. there is also a tutorial for real-time data analysis here https://cloud.google.com/solutions/real-time/fluentd-bigquery

so what are other pros and cons, and in what case i should use the dataflow to stream data from PubSub

1

1 Answers

5
votes

With the Google Dataflow and PubSub you will have full control of your streaming data, you can slice and dice your data in real-time and implement your own business logic and finally write it to the BigQuery table. On the other hand using other approaches to directly stream data to BigQuery using BigQuery jobs, you definitely loose the control over your data.

The pros and cons really depends upon what you need to do with your streaming data. If you are doing the flat insertion no need for Dataflow but if you need some serious computation like group by key, merge, partition, sum over your streaming data then probably the Dataflow will be the best approach for that. Things to keep in mind is the cost, once you start injecting serious volume of data to PubSub and use the Dataflow to manipulate those it starts getting costly.

To answer your question, yes you can eliminate duplicate rows using Dataflow. Since Dataflow has full control of the data You can use pipeline filters to check for any conditions meeting the duplicate values. The current scenario I am using the Dataflow pipeline is for manipulating my customer log record in real-time with serious pre-aggregation done with Dataflow and stream of logs passed through PubSub. Dataflow is very powerful for both batch and streaming data manipulation. Hope this helps.