0
votes

I have a fact table which connects to a dimension table (The dimension table has 16 million records), In order to optimize the join, Is it ideal to partition the dimension table based on the SK field using Bigquery integer range partitioning ?

What is the best way to efficiently join to this dimension since the dimension table has 16 million records ?

Thanks

1
What's the current time that you consider slow or the join is efficient? - Pentium10

1 Answers

0
votes

I'll recommend you to cluster, not partition - especially since you have not indicated the range of the ids and how it will change through time.

However, I tested with 20 million records one pattern of querying, and there was no advantage at this scale with either clustering or nothing:

CREATE TABLE temp.lookup_clustered
PARTITION BY fake_date 
CLUSTER BY id
AS
SELECT FARM_FINGERPRINT(FORMAT('%t%t%t',date, wban,stn)) id, *
FROM `fh-bigquery.weather_gsod.all` 
WHERE name<'C'
;
CREATE TABLE temp.lookup_plain
AS
SELECT FARM_FINGERPRINT(FORMAT('%t%t%t',date, wban,stn)) id, *
FROM `fh-bigquery.weather_gsod.all` 
WHERE name<'C'
;
CREATE TABLE temp.ids AS 
SELECT id FROM temp.lookup_plain
;

SELECT MAX(temp) 
FROM (SELECT id FROM temp.ids LIMIT 1000 )
JOIN `temp.lookup_clustered`
USING(id) 
# 2.1 sec elapsed, 440.2 MB processed
# Slot time consumed 32.846 sec
# Bytes shuffled 26.51 KB
;

SELECT MAX(temp) 
FROM (SELECT id FROM temp.ids LIMIT 1000 )
JOIN `temp.lookup_plain`
USING(id) 
# 1.8 sec elapsed, 440.2 MB processed
# Slot time consumed 34.740 sec
# Bytes shuffled 26.39 KB

Use a similar script to test the best strategy for your use cases (which are missing from the question). And please report results!