Denormalising mysql tables for google bigquery

Question

I have following schema in Mysql(simplified for this question. In real it contains more tables than given here)

User: id, email, first_name, last_name, gender, birthday and 30 more such columns

Account: id, user_id, total_visits, total_credits, total_redemptions, total_debits, points, initial_credit, initial_debit & 20 more such columns

Checkin: id, user_id, location_id, approved, amount, number, checkin_date, status, qr_code, barcode, points_earned and 30more such columns.

Here

id - primary key. integer
table_id - foreign key. for example user_id in accounts, table points to user's id col in User table.

To import this, advice in the docs, is:

In BigQuery, you typically want to denormalize the data structure in order to enable super-fast querying. While JOINs on small datasets are possible with BigQuery, they're not as performant as a denormalized structure.Some type of normalization is possible with the nested/repeated functionality.

If I understand this, does that mean:

there will only table: User with 100+columsn(data from all these tables(Account, Checkin etc)
there will be One User table and One Events tables. User datable will have pretty same schema it currently has in mysql. events table will store actual data checkins, account.
Some Other type of schema?

Further, we can I find more resources that go indepth on denormalising mysql tables for Bigquery?

Mosha Pasumansky Mosha Pasumansky · Accepted Answer · 2015-04-18T16:29:03

When designing schema in BigQuery, it is important to look at table statistics. BigQuery has two major implementations for JOIN algorithm - one which is very fast, but scales up to few MBs, and another one which can scale to any size, but is slower. Let's take the User table. If you are dealing with tens of millions of users - this table will likely be more than 10 MBs, but if you have tens of thousands of users - it will be well below that limit. In this case you can leave it as separate table without sacrificing performance. So if the numbers work well - then I would recommend something similar to the approach #2 - one User table (small) and one Events table (huge).

Denormalising mysql tables for google bigquery

2 Answers