1
votes

After a few days testing and developing some Google Dataflow templates with python, I'm questioning myself about if using BigQuery is appropriate for our project.


Why this thoughts?


  • BigQuery focuses on Online Analytical Processing (OLAP) so treat BigQuery like an Online Transaction Processing (OLTP) it might not be the best option.

    Even knowing that, I decided to use BQ, because it has better compatibility with Google Dataflow.


  • After develop a first version of a python dataflow's template and try with a few data this works well. Using streaming or batch execution of this pipeline I don't have any special issues.

    The problems begin when I try to use BQ like OLTP. Here is where I found some issues due to my use case: In some case, I can not use bulk updates and I have to use single row mutations, so I have to use DML sentences to perform this task. The result is that the execution time begin to increase considerable with hundred of single row mutations.


  • Even If I assume this increase of time, I found that in BQ you can only use 1,000 updates per table per day. (More quotes)


Alternatives

Here is where I'm not sure about which other services should I use instead of BigQuery:

  • Cloud SQL is the OLTP database of Google. This could be the appropriate for my use case, but: is Cloud SQL work as fast as BQ if I try to analyze the data? I will have many tables with million of rows and, for example, working with Microsfot Sql Server is not sufficient for me due to the high processing times.

  • How Cloud SQL work with apache beam and pipelines?

  • Works with Cloud SQL and migrate data to BigQuery each day to avoid make single row mutations in BQ?

  • Which service could be the best to use in this case?

2

2 Answers

0
votes

How Cloud SQL work with apache beam and pipelines?

The o/p of final transform will be a collection not a string, Writing collection to cloudsql might get overloads, high latency. You could write using MySqlDb.connect in #pardo function where you process single element[TableRow] [Not a good practice].

Works with Cloud SQL and migrate data to BigQuery each day to avoid make single row mutations in BQ?

You can write a python script which exports data from cloudsql table to either GCS or LocalStorage[where your script is scheduled to run] and then upload back to BQ.

But you've to make sure about NULL Values. When you export data as a file, Null values in table will be replaced with ['N] leads to inappropriate row-value count while uploading to BQ. This all can be done with single python script.

Which service could be the best to use in this case? enter image description here

0
votes

I would avoid using BigQuery for transactional workloads, as it tends to be very inefficient for update operations. Similarly, Cloud SQL will be good for transactional workloads but slow for analytical ones.

If you need a database that can both handle updates and an analytical workload, I’d suggest looking into BigTable. You can read about how to pick a storage option for GCP here.

Another option, as you said, would be doing all your transactional work in Cloud SQL and once a day or so syncing it to BigQuery and doing the analytical work there, for instance WePay uses a workflow like that.

To address your other question, yes, Dataflow can work with Cloud SQL without any issue just connecting normally, just make sure to use connection pooling to avoid running out of connections.