0
votes

We want to move our data warehouse from a MySQL database to either Redshift or BigQuery.

While optimised for OLAP operations, one of the disadvantages of these columns based databases is that they do not enforce unique constraints.

As such, it is not impossible to have duplicate orders/products in your tables. The industry we work for is retail and we use the standard Kimball facts and dimensions (star schema) database design.

One potential solution that was brought forward was to build the database in MySQL and to use a third-party replication tool to synch to data to Redshift/BigQuery. This way, we would enforce key constraints in the original MySQL db and we would use the Redshift/BigQuery only for read queries.

However, enforcing the constraints in MySQL and setting up a bin log replication to Redshift/BigQuery will keep the data identical to the one in MySQL and consequently enforcing unique constraints?

2

2 Answers

1
votes

First of all, you cannot replicate from MySQL to RedShift/BigQuery.

Please understand that BigQuery is an analytical database.

What is advised you setup a replication from MySQL inside Cloud SQL. Then in BigQuery you can run now EXTERNAL_QUERY which means you can query/join your BQ database with Cloud SQL MySQL database.

  1. Setup replica from your current instance to a Cloud SQL instance, follow this guide.
  2. Understand how Cloud SQL federated queries let's you query from BigQuery Cloud SQL instances.

You get this way a live access to your relational database as:

Example query that you run on BigQuery:

SELECT * EXTERNAL_QUERY(
'connection_id',
'''SELECT * FROM mysqltable AS c ORDER BY c.customer_id'');

You can even join Bigquery table with SQL table:

Example:

SELECT c.customer_id, c.name, SUM(t.amount) AS total_revenue,
rq.first_order_date
FROM customers AS c
INNER JOIN transaction_fact AS t ON c.customer_id = t.customer_id
LEFT OUTER JOIN EXTERNAL_QUERY(
  'connection_id',
  '''SELECT customer_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY customer_id''') AS rq ON rq.customer_id = c.customer_id
GROUP BY c.customer_id, c.name, rq.first_order_date;
0
votes

The solution you put forward will allow:

  • to enforce unique key constraint on the source MySQL database
  • to replicate/capture all changes that happen of that database to your data warehouse

That being said, what you end up with on your data warehouse is a view of all the events (insert, update, (delete: not supported by all SaaS offerings...) ) that have changed your MySQL DB. Hence the "raw" tables in your warehouse will have multiple events per unique key of your MySQL and you would then need to reprocess these events to end up with the same tables as you have in your MySQL.

To illustrate this further: it's like if your MySQL tables at each point in time are a snapshot or frozen picture/state whereas what you get from binlog replication is the "movie" of all successive state changes of your database. If you want a snapshot in your warehouse you then need to "replay" all the changes up to the point for which you want the snapshot for.

This is pretty powerful in that you never loose any change happening on your database and can always find it back. But it does incur additional work to get your data warehouse tables to the same "snapshot" shape of your input database.

This can generally be done on your warehouse via a CTE that adds row_number() over (partition by id order by updated_at desc) as rn and then filter that CTE on where rn = 1 and deleted_at is null (with id being the column with your unique constraint, you can list multiple if your unique constraint is composite (on multiple keys) and updated_at being the timestamp of each change data capture event and deleted_at being the timestamp of delete events (or null if no delete events have happened for a given key) ).

For open source and self-hosted change data capture, you can also look into things like Debezium that runs on Kafka Connect (or AWS Kinesis or others...) if that's infrastructure your client would be willing to invest in... Or just look at logical replication connections in your language of choice's database engine/lib for your preferred DB (e.g I use psycopg2 (with extras) for PostgreSQL on Python...)