3
votes

I need some suggestions for one of the use cases I am working on.

Use Case:

We have data in Cloud SQL around 5-10 tables, some are treated as lookup and others transactional. We need to get this to BigQuery in a way to make 3-4 tables(Flattened, Nested or Denormalized) out of these which will be used for reporting in Data Studio, Looker, etc.

Data should be processed incrementally and changes in Cloud SQL could happen every 5 min, which means that data should be available to BigQuery reporting in near real-time. If possible we should also maintain the history of Data change.

Solution:

We kept 3 layers in BigQuery, so data from Cloud SQL comes to the first layer, then after flattening we save it to the second layer (which maintains history), and then processes it to insert in the third layer(Reporting). The problem with this approach is that it was developed for daily processing and can not help in processing every 5 min change of data in Cloud SQL.

Any suggestion is welcome.

Note:- We would require to use Dataflow, VM or some other way to do some logic processing, means we can not use only scheduled query as we can not use query to process all logic, but yes mix of view and code can be used.

2

2 Answers

6
votes

This is more straightforward than you might think: BigQuery can read data straight out of your Cloud SQL.

Docs:

More reading:

So all you need to do is write a MySQL query that finds all the changes from the last 5 minutes. You can run this query from within BigQuery - and materialize all the data into BigQuery with one query like:

INSERT INTO `dataset.table`

SELECT * 
FROM EXTERNAL_QUERY(
  'your.CloudSql.database'
  , 'SELECT * FROM users')
0
votes

That's an interesting use case! If BQ external queries are not your jam, we recently built a solution to 'sync' tables from MySQL with tables from BigQuery.

The solution works by using two components:

  1. A connector developed with Debezium to publish changes from MySQL to PubSub
  2. A Dataflow pipeline that consumes those changes from PubSub, and pushes them to BigQuery

Refer to this README to figure out how to deploy it: https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/master/v2/cdc-parent/

Let me know if it's helpful!