4
votes

I want to log events from my client component and analyze them in google's BigQuery. My problem is that the events are of several different types (with potential for more types to be added in the future) - each event type has a different number and types of properties.

For example:

{"event":"action",
    "properties":{"ts":1384441115,
          "distinct_id":"5EB54670",
          "action_type":"pause",
          "time":"5"}}

{"event":"action",
    "properties":{"ts":1384441115,
          "distinct_id":"5EB54670",
          "action_type":"resume",
          "time":"15"}}

{"event":"section",
    "properties":{"ts":1384441115,
          "distinct_id":"5EB54670",
          "section_name":"end",
          "dl_speed":"0.5 Mbit/s",
          "time":"25"}}

My question is - how do I handle this diversity in a tabular DB? My reason for choosing BigQuery is its ability to handle big data calculation and analysis of my logged events, but for that to happen I need to figure out the best practice to log these events.

I thought about 2 options:
1. have a large table that has columns for every property of every event type - in this case every row will contain empty fields.
2. have a separate table for each event type - this raises two issues - future events will call for new tables, and even worst - I loose ability to perform calculations over all events (seeing as all events share some properties like ts, distinct_id and time)

I'm pretty sure I am not inventing the wheel with my use-case, so I would love to hear about the best practices from you guys. Thanks!

Amit

1

1 Answers

4
votes

You have a number of options:

  1. Use wide schema. You can have a column for every property type. You can add columns to the table by using the tables.update() method. While it may seem inefficient to have a lot of null columns, this is actually the most efficient way to store and query your data.

    Null values don't cost anything to store, (e.g. if you have a table with a million rows and a column that only has 10 rows with a value and the rest are null, you only get charged for storage of the 10 values). Even better, null values don't cost anything to query either. Having a wide table schema will mean that your queries are less expensive, since you won't be reading all of the properties on each query, just the columns that you care about.

  2. Store the properties in a repeated field as key-value pairs. In that case, you'll likely need a keyword that we haven't yet documented -- OMIT ... IF. This is a pretty clean way of doing it, you'd end up with queries that look like

    SELECT properties.value FROM my_table
    OMIT properties IF properties.name <> "dl_speed"
    

    Of course, some queries could get pretty awkward in this scenario.

  3. Store the properties in a JSON field, and extract the field names you need in the query. We've recently added a couple of functions that will make this easy and efficient, however they haven't quite made it to production yet. I'll try to remember to update this answer when these go live, which will hopefully be today, but release schedules in december can be unpredictable.

  4. I'd recommend against having a separate table to join against. While this is the common way to do things in a relational-database world, this is going to be less efficient in BigQuery. We usually recommend that you denormalize your data.