1
votes

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.

To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.

This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.

For example, I have this table:

AccounDeletionRequest Schema

For which I issue this API call:

{
  'tableReference': {
    'projectId': 'redacted',
    'tableId': u'AccountDeletionRequest',
    'datasetId': 'Latest_Production_Data'
  }
  'view': {
    'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
  },
}

This results in the following error:

HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">

When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.

Select star results

Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?

Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.

2
I remember the BQ devs saying something about views having some bugs with SELECT *. Can you try this once with the SELECT statement having all field names explicitly mentioned?Rohit
I imagine that would work, but it'd require reading the schema of each table to generate the SQL defining each view. Instead, I'd probably just fall back to the table copy approach.Matt Faus

2 Answers

1
votes

Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:

def get_leaf_column_selectors(dataset, table):
    schema = table_service.get(
            projectId=BQ_PROJECT_ID,
            datasetId=dataset,
            tableId=table
        ).execute()['schema']

    return ",\n".join([
        _get_leaf_selectors("", top_field)
        for top_field in schema["fields"]
    ])


def _get_leaf_selectors(prefix, field):
    if prefix:
        format = prefix + ".%s"
    else:
        format = "%s"

    if 'fields' not in field:
        # Base case
        actual_name = format % field["name"]
        safe_name = actual_name.replace(".", "_")
        return "%s as %s" % (actual_name, safe_name)
    else:
        # Recursive case
        return ",\n".join([
            _get_leaf_selectors(format % field["name"], sub_field)
            for sub_field in field["fields"]
        ])
1
votes

We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).

The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.