I have a database in Google BigQuery with million of rows (more than 2 million new rows every day) contains of my user activities
I create a PHP program to get the insight from this database with many queries to show like statistic of data per day, per hour and many more
I have two cases with two problem:
I try to find the data of user activities in date between 2016-11-01 and 2016-11-10, and then I need to break down the data for only 2016-11-05 data only (the data basically is the subset of queries result). This data need to be clasify per day or per hour or per user type and many more. Right now I use many queries in database to group this data and to do many data manipulation. For example "SELECT * FROM user_activities WHERE date>='2016-11-01' AND date<='2016-11-10' GROUP BY date,hour" and then when I need to break down the data in 2016-11-05 only I re-run the query: "SELECT * FROM user_activities WHERE date='2016-11-05' GROUP BY date,hour"
Or sometimes I need to query the data with different parameter, for example the user activities between 2016-11-01 and 2016-11-10 who contains activities "A", and then I need to change witch activities "B". I have a column to identify the type of activities that user do. Right now I run the query like "SELECT * FROM user_activities WHERE activities like 'A' and then when the activities type is changed I run new query "SELECT * FROM user_activities WHERE activities like 'B'.
So my question is: Because the data on my database is so big, and because the insight query activities in my PHP program is so high frequency, the cost of data management and processing become so high. For case like case 1 and 2 is there any alternate solution like PHP caching to make the database request become less?
In just 1-2 days my BigQuery data request can become Terabyte of data. I'm afraid it is not too efficient in term of my cost database management.
As far I have tried these solutions:
- I take the raw data from my database, cache it on PHP and run the data manipulation manually. For example I run "SELECT * FROM user_activities WHERE date>='2016-11-01' AND date<='2016-11-10'" and then I try to run data manipulation like group by hour or group by user type or group by user activities manually and sequentially on PHP function. But because my data contains million of data the process become so long and not efficient.
- I take the raw data from my database, insert it to temporary table, and then manipulate the data by query to temporary table. But this process become not efficient too because the insert process for million rows of data become so long.
Do you have any suggestion how can I optimize my problem?