1
votes

I have a database in Google BigQuery with million of rows (more than 2 million new rows every day) contains of my user activities

I create a PHP program to get the insight from this database with many queries to show like statistic of data per day, per hour and many more

I have two cases with two problem:

  1. I try to find the data of user activities in date between 2016-11-01 and 2016-11-10, and then I need to break down the data for only 2016-11-05 data only (the data basically is the subset of queries result). This data need to be clasify per day or per hour or per user type and many more. Right now I use many queries in database to group this data and to do many data manipulation. For example "SELECT * FROM user_activities WHERE date>='2016-11-01' AND date<='2016-11-10' GROUP BY date,hour" and then when I need to break down the data in 2016-11-05 only I re-run the query: "SELECT * FROM user_activities WHERE date='2016-11-05' GROUP BY date,hour"

  2. Or sometimes I need to query the data with different parameter, for example the user activities between 2016-11-01 and 2016-11-10 who contains activities "A", and then I need to change witch activities "B". I have a column to identify the type of activities that user do. Right now I run the query like "SELECT * FROM user_activities WHERE activities like 'A' and then when the activities type is changed I run new query "SELECT * FROM user_activities WHERE activities like 'B'.

So my question is: Because the data on my database is so big, and because the insight query activities in my PHP program is so high frequency, the cost of data management and processing become so high. For case like case 1 and 2 is there any alternate solution like PHP caching to make the database request become less?

In just 1-2 days my BigQuery data request can become Terabyte of data. I'm afraid it is not too efficient in term of my cost database management.

As far I have tried these solutions:

  1. I take the raw data from my database, cache it on PHP and run the data manipulation manually. For example I run "SELECT * FROM user_activities WHERE date>='2016-11-01' AND date<='2016-11-10'" and then I try to run data manipulation like group by hour or group by user type or group by user activities manually and sequentially on PHP function. But because my data contains million of data the process become so long and not efficient.
  2. I take the raw data from my database, insert it to temporary table, and then manipulate the data by query to temporary table. But this process become not efficient too because the insert process for million rows of data become so long.

Do you have any suggestion how can I optimize my problem?

1
I remember I recommended you partitioned tables, did you implemented that? Can you post us also the costs and size of the queries you have for 1 day. Also how many queries you run per day.Pentium10
Hi @Pentium10 yes I have implemented your suggest for partition per day and yes it works very well to reduce the cost :) But I have new problem for redundant query request with small parameter different. Per insight request (with dozens queries) the cost of query will become almost 1TB (it has been much better rather than before I implement the daily partition where can cost multiple TB data per insight). And I need to call the insight multiple times per day. I just curious is there any more advanced optimization I can do :)hum_hum_pa
show us the schema for the tables, and produce us a chart per day # of records to see if you need to partition bigger. What's your current cost that you reached?Pentium10

1 Answers

0
votes
  1. Implement the Partitioned Tables as has been recommended for you.
    If you have one single big table with 5TB of data without partition your costs are high.
    When you do Partitioned Tables, you have only the storage for those days to query not the whole table. Just a fraction of it, like 10GB or smaller. And you pay that only.

  2. You can save a query result into a table directly instead of reimporting as you say, and query only that table which is smaller for further aggregation.

  3. Try to not use 'SELECT *' instead just select the columns you must have in your output.
  4. If the data is enough small, and you do lots of small querios on it, you may want to take out from BQ and store in ElasticSearch or MySQL and run from there the queries.