0
votes

I want to ensure that I don't negatively impact query performance based on schema design for BigQuery. I have two tables which I need to perform a join. The column that I will use to join the tables could be of type INTEGER or STRING. STRING would be easier in my case as it wouldn't require any new validation within our code base to ensure all values are of type INTEGER. But I don't want to join on a type STRING, if query performance will be significantly worse than running the join on an INTEGER type column.

Is there a large performance difference in BigQuery when the join is on type STRING vs type INTEGER?

---Update 10/16---

I ran some basic analysis to test this, here are the results:

Using public dataset, users table has 10M rows and posts table has 31M rows

Join on Integer: 2.78 sec elapsed, 318.1 MB processed (avg over 10 runs)

Join on String 6.77 sec elapsed, 137 MB processed (avg over 10 runs)

-- Join on Integer Query

SELECT count(*)
FROM `bigquery-public-data.stackoverflow.users` u
JOIN `bigquery-public-data.stackoverflow.stackoverflow_posts` p
  on u.id = p.owner_user_id
WHERE RAND() < 2

(Where clause added to avoid cache)

-- Join on String

SELECT count(*)
FROM 'bigquery-public-data.stackoverflow.users' u
JOIN 'bigquery-public-data.stackoverflow.stackoverflow_posts' p
  on u.display_name = p.owner_display_name
WHERE RAND() < 2

(Where clause added to avoid cache)


Surprisingly, JOIN on STRING appears to perform worse than INTEGER.

1

1 Answers

4
votes

No, you won't see any significant difference. Go with the schema that is more natural for your use case.