2
votes

I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.: https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries

enter image description here

But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming: https://cloud.google.com/sql/docs/mysql/replication/replication-from-external

The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:

SELECT * from table WHERE id=@parameter;

which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.

What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?

3
If you are looking for Streaming data into BigQuery, I don't think that using EXTERNAL_QUERY() is the way to go. Instead of planning to move data from PostgreSQL to CloudSQL, why you don't directly stream the data into BigQuery? It is not straightforward but with a proper Dataflow JOB you can do it. Maybe to make it simpler, you can batch every X hours. - Alvaro
What would be the limitation of EXTERNAL_QUERY? Performance? - 719016
Performances will be the most important one, yes. But there are more - cloud.google.com/bigquery/docs/…. Do you need your data in real time? or that is not important? Can you consider building a pipeline to extract data and loading into BQ? - Alvaro

3 Answers

4
votes

Perhaps you missed the options stated on the google cloud user guide?

https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication

Notice in this section, it says:

"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."

I suspect online mode is what you are looking for.

0
votes

What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).

In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.

On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.

In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.

0
votes

Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:

SELECT * from table WHERE id=$filter_value;

This should achieve the same results you want without going through BigQuery.