2
votes

I'm struggling to find documentation on the inner workings of federated queries in BigQuery when it comes to the impact of those queries on a Cloud SQL (MySQL) database.

I've seen here that federated queries are read only, no problem there.

The question I have is: are federated queries just like a normal query against the database as if you were using a management tool such as MySQL Workbench, or is there some other mechanism that the federated queries uses to read the underlying tables without impacting the usual performance of the database? For instance, if a table/row is locked via a LOCK TABLE statement/transaction and then a federated query is fired, will the federated query still run?

My use case is for incrementally loading data out of Cloud SQL and into a BigQuery datawarehouse. I want to minimise the impact on my OLTP database as much as possible without the overhead of creating and maintaining a streaming pipeline.

1
I’m very happy with Stitch Data to do exactly what’s you are looking for. You can see more details in this page: stitchdata.com/integrations/mysql/google-bigqueryMurta

1 Answers

3
votes

I have not been able to find any details in the GCP Public documentation about the inner workings of the Federated Queries and, therefore, this mechanism might be confidential. However, I have run the setup that you described in this question and it seems like BigQuery performs normal queries in the Cloud SQL instance.

I created a Cloud SQL instance with mysql and set up an external connection on BigQuery. If I execute [1], I'm able to retrieve results from BigQuery as well as from another cloud shell window (which is using a different user from the one that has the lock).

However, If I execute [2] in the original cloud shell, I'm not able to retrieve any results (the query keeps waiting) in both BigQuery and my other cloud shell instance, until I execute [3] ( I had left it running for more than 10 minutes).

Having taken a look at (4), for improving my understanding between the read and the write locks, I would say that BigQuery is behaving in a similar way to other tools that can query the Cloud SQL instance.

Regarding your use case scenario for migrating the data from Cloud SQL into BigQuery minimising the impact on the database, I understand that saving csv files into Google Cloud Storage and then creating a BigQuery transfer does not suit your business scenario. On the other hand, the new GCP product, Database Migration Service (5) does not support BigQuery as the destination.

Finally, you might want to consider using a pipeline which streams data from Cloud SQL to Pub/Sub and that finally inserts that data into BigQuery given that, if you migrate your data with federated queries, “BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior” (6).

[1]: LOCK TABLES tablename READ;

[2]: LOCK TABLES tablename WRITE;

[3]: UNLOCK TABLES;