1
votes

The _TABLE_SUFFIX feature is great and exactly what I was looking for to solve my problem - however it is scanning all of the data matched by the wildcard when I use a sub-query to determine which tables to match on.

If you do an operation such as = or BETWEEN or IN with a set of values on _TABLE_SUFFIX, you can see the amount of data being scanned goes down compared to simply a wildcard:

SELECT sample_data FROM `test.dataset.*` WHERE _TABLE_SUFFIX IN ("NWD1","NWD2","NWD3","NWD4","NWD5") - 1.8 GB scanned

However if I do the following:

SELECT sample_data FROM `test.dataset.*` WHERE _TABLE_SUFFIX IN (SELECT ID FROM subset) - 50GB scanned (this sub-select contains the same values as shown in the explicit IN clause)

3
Filters on _TABLE_SUFFIX that include subqueries cannot be used to limit the number of tables scanned for a wildcard table.Mikhail Berlyant
ok are there plans to support this? - to be clear, it works, but it doesn't reduce scanned bytesAlexander Baumann

3 Answers

3
votes

Constant filters on _TABLE_SUFFIX will reduce the amount of data queried, but not if those filters are coming from a dynamic subquery.

As an alternative - have you considered clustered storage?

Clustered tables are able to optimize the amount of data queried, even when filtering with a dynamic subquery.

SELECT sample_data FROM `test.dataset.*`
  WHERE clustered_column IN (SELECT ID FROM subset)

would work.

For example

SELECT MAX(title), wiki
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` 
WHERE DATE(datehour) = '2018-01-10'
AND wiki IN (
  SELECT wiki 
  FROM  `fh-bigquery.wikipedia_v3.pageviews_2018`   
  WHERE  DATE(datehour) = '2018-01-01'
  AND NOT wiki LIKE 'e%'
  LIMIT 3
)
GROUP BY 2 

queries 0.341 GB, instead of 10 GB

1
votes

Maybe the best you can do is to generate IN clause with a query like below then form another query with generated IN clause:

select concat('IN ("', string_agg(ID, '", "'), '")') as in_clause 
from subset;
1
votes

It is not about the subqueries per se - if the limiting data sits in the query it is all fine - even when using subqueries:

CREATE TEMP FUNCTION daterange(suf string) as ( suf between '04' and '06');

WITH vars AS (
  SELECT
    '04' as startDate,
    '06' as endDate,
    ['04', '05', '06'] as daterange
  )

SELECT 
  *
FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_201611*` t -- 513.6
WHERE 
  _TABLE_SUFFIX between '04' and '06' -- 39.4
  --(select _table_suffix between startDate and endDate from vars) -- 39.4
  --daterange(_table_suffix) -- 39.4
  --_table_suffix in unnest( (select daterange from vars) ) -- 39.4

hth!