2
votes

I have following schema in Mysql(simplified for this question. In real it contains more tables than given here)

User: id, email, first_name, last_name, gender, birthday and 30 more such columns

Account: id, user_id, total_visits, total_credits, total_redemptions, total_debits, points, initial_credit, initial_debit & 20 more such columns

Checkin: id, user_id, location_id, approved, amount, number, checkin_date, status, qr_code, barcode, points_earned and 30more such columns.

Here

  1. id - primary key. integer
  2. table_id - foreign key. for example user_id in accounts, table points to user's id col in User table.

To import this, advice in the docs, is:

In BigQuery, you typically want to denormalize the data structure in order to enable super-fast querying. While JOINs on small datasets are possible with BigQuery, they're not as performant as a denormalized structure.Some type of normalization is possible with the nested/repeated functionality.

If I understand this, does that mean:

  1. there will only table: User with 100+columsn(data from all these tables(Account, Checkin etc)
  2. there will be One User table and One Events tables. User datable will have pretty same schema it currently has in mysql. events table will store actual data checkins, account.
  3. Some Other type of schema?

Further, we can I find more resources that go indepth on denormalising mysql tables for Bigquery?

2

2 Answers

4
votes

When designing schema in BigQuery, it is important to look at table statistics. BigQuery has two major implementations for JOIN algorithm - one which is very fast, but scales up to few MBs, and another one which can scale to any size, but is slower. Let's take the User table. If you are dealing with tens of millions of users - this table will likely be more than 10 MBs, but if you have tens of thousands of users - it will be well below that limit. In this case you can leave it as separate table without sacrificing performance. So if the numbers work well - then I would recommend something similar to the approach #2 - one User table (small) and one Events table (huge).

0
votes

This is a common need when building a database for reporting purposes. Generally we prefer normalized schemas for quick writing, low disk space, and data integrity, but when reporting we like highly aggregated very denormalized schemas so that a single table read is all that is needed.

I would work towards a single table if possible. Go to your lowest level of granularity, likely your checkin.id and join to your other tables from there grabbing only the fields you will need in bigquery.

As for number of columns, I wouldn't worry about it too much. We've built single object datastores in SAP BW that were denormalized down to the invoice line containing point in time customer information, company hierarchies, material/sku attributes, dates denormalized into months, quarters, years, and fiscal periods. In the end we often have over 200 columns. It's much much much much much faster than joining live at query runtime across a more normalized schema. In fact, the normalized schema probably wouldn't even return.

It feels wrong, but when your primary goal is fast data retrieval, rather than disk space, duplicating data, and all the other things we worry about when building a front end, then fully denormalized data is something to aim for.