1
votes

I'm using Google BigQuery to store the user data activities at my app.

I created a table in BigQuery, lets call it "user_activities_data".

This data including many data user activities that always be updated every second. Every second there will be thousands of new data inserted to the table.

I do a lot of query to this table.

I know that BigQuery is charged by how much data that you use for every query you do. I already try to optimize my query to use as low as possible using the data size (by selecting only needed column).

For example:

SELECT username,activity FROM user_activities_data WHERE date_activities>='2016-10-01' and date_activities<='2016-10-31'

But because every day my number of data always increasing, the pricing become increasing too. Even though I use the same query with the same where limitation (like the example of query above).

So my question is what is the most optimal way to optimize the pricing?

Last month for each query I will be charged around $0.2, and this month because the amount of my data has been doubled I got bill $0.4 per query. I did around hundreds of queries every day so the pricing can be so huge if not managed well.

I have several options:

  1. Use daily partition option at BigQuery, but I don't know if I did it correctly the pricing (by number of data processed) is the same. (Did I do something wrong at this?)

  2. Divide the table into many tables per month, like: user_data_activities_oct16, user_data_activities_nov16 and more. Note: Some times I need to query data between months like data of user activities from August 31th 2016 to November 5th 2016, is this a good solution?

I open for any suggest :)

Thank you very much

1

1 Answers

1
votes

You need to use Partitioned Tables.

You need to create a table that is type partitioned by day for example. And then you can write into each day data like this mydataset.table$20160519, and you will query like this:

SELECT
  field1
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")

as you see the pseudo column _PARTITIONTIME is the way to restrict the query to touch only a partition. This means that query price will be calculated only to the data that is touched during the mentioned date interval, and not to the whole table. This also doesn't stop you to do your full table queries.