Need suggestion on splitting table in bigquery based on non-date column along with date partition

Question

We are having a date partitioned table with 5 yrs(with daily incremental load) data running into millions & millions of records. To improve the performance, thinking of splitting the table based on a non-date field(id) as all the queries will include a where clause on that column(id). And also partition each of split tables with date partition so that we can query on a smaller dataset with a date range. we will not be using wildcarded table as we will know the id and planning to append that to the table and run a query against that specific table. Need to know whether that would be good option to pursue to improve performance and reduce the query cost.

[Update]: We went ahead and split the tables based on id column(tablename_id) and made the table date partitioned and clustered with 4 other columns(max supported) which are commonly used in queries. With that we were able to get a better performance and also reduced the data accessed for each query. Based on the testing looks like it is a good option to puruse as long as wildcarded querying of tables is avoided and till Bigquery supports partitioning based on non-date/non-datetime columns.

Thanks @ElliottBrossard. We already had our tables clustered with 4 other columns. we tried the split table approach based on id and partitioned by date along with clustering. The split table approach brought down the time taken for executing the queries as well reduced the amount of data processed to return the results. seems like good option to take till Bigquery supports non-date/non-datetime column based partitioning. — Manoj K J
That's great! I see that you updated the question, but consider adding your findings as an answer instead to help people find it in the future. Thanks! — Elliott Brossard
@ElliottBrossard I have added the approach and the stats for the same as an answer. — Manoj K J

Manoj K J Manoj K J · Accepted Answer · 2019-08-07T13:31:08

We split the tables based on the id columns creating multiple tables. Each of the split tables are partition on date column. Apart from that we had it as clustered table on 4 other columns as needed. Find below the performance on a sample dataset. Old Table(UserInfo) has more than 500,000 rows. The stats we captured is for a given date range and id, the performance of old table(non split/combined table) and split table(split based on the ID) in terms of amount of data processed and the time taken for the same query.

Need suggestion on splitting table in bigquery based on non-date column along with date partition

2 Answers