0
votes

Say I have tables in BigQuery with the format table_name_YYYYMMDD. I know in standard SQL you can use wildcards with _TABLE_SUFFIX to select a range of tables in a query like this:

SELECT
  user_id
FROM `project.dataset.table_name_*`
WHERE _TABLE_SUFFIX BETWEEN '20200520' AND '20200525'

My question is whether there is a way to tell the number of tables that were actually selected in the end (i.e. 20200520, 20200521, 20200522, 20200523, 20200524, 20200525 would be 6 tables).

In my case, I have a query that uses 30 tables to create a monthly summary, but sometimes one or more of those 30 tables might not exist, in which case I would like to throw an error instead of silently querying less than the 30 tables I expect. (I know this can be done in Legacy SQL with TABLE_DATE_RANGE but I was wondering if there is a way to accomplish it in Standard SQL).

2

2 Answers

0
votes

In order to further contribute to the community and in addition to Mikhail's answer, I would like to provide you two other options.

First using the Python API for BigQuery. I created the following script:

import argparse
from google.cloud import bigquery


client = bigquery.Client()
dataset_ref = client.dataset("dataset_name", project="project_id")
dataset = client.get_dataset(dataset_ref)

parser = argparse.ArgumentParser()
parser.add_argument('-t', type=int,
                    help='number of tables that we need to be queried')

parsed = parser.parse_args() 
#retrieve the integer you passed through the command line
arg = parsed.t

def check_tables(arg):
    #counting the number of tables give the desired wildcard
    query = """SELECT COUNT(DISTINCT _TABLE_SUFFIX)
                    FROM `project.dataset.table_*`
                    """
    query_job = client.query(query) 
    result= query_job.result()
    for r in result:    
        count=r[0]
        print("There are {} tables".format(count))
    if count==int(arg):
        return True
    else:
        return False


if check_tables(arg): 
    print("In total BigQuery is querying {} tables".format(arg))
    #your new SQL query here, which will execute only if there are the amount of tables you specified
    #you can output the results to a new table

As shown above, the script used argparse library from Python. Therefore, in order to run the script, you have to use python script_name.py <int64>, in my case I ran python query.py 3. In other words, you have to specify the number of tables you are expecting to query using wildcards. The script checks how many tables there are with the given wildcard. Then, it compares with the number of tables we expect and if the numbers are equal you can execute another query (after the last if). Otherwise, output a message.

The second option is using the BigQuery Console. We will be taking advantage of Scripting. The script will basically check the number of tables within the desired wildcard, evaluate if there are the expected number of tables. Then, if there are, execute the query. The syntax is below:

DECLARE num_tables INT64 DEFAULT 3;
DECLARE count_tables INT64;
SET count_tables = (SELECT COUNT(DISTINCT _TABLE_SUFFIX) FROM `test-proj-261014.sample.abril_airports_*`);

IF(num_tables = count_tables) THEN 
SELECT *  FROM `project.dataset.table_*`;
END IF;
0
votes

You can use Jobs.Get API to see referencedTables attributes which is array of all tables touched by query

Or you can use INFORMATION_SCHEMA.TABLES to just simply count number of tables matching your pattern. Something like

SELECT COUNT(1) 
FROM yourDataset.INFORMATION_SCHEMA.TABLES 
WHERE TABLE_NAME BETWEEN 'table_name_20200520' AND 'table_name_20200525'