2
votes

When updating a view, indirect schema changes do not seem to be picked up.

Steps to reproduce

  • Create view1 with a field field1 (e.g. SELECT 1 AS field1)
  • Create view2 selecting all fields from view1
  • Update view1 to also include field2 (e.g. SELECT 1 AS field1, 2 AS field2)
  • Update view2 with the same query as before (due to documented limitation)

Desired outcome

  • Schema for view1 and view2 including field1 and field2
  • View updates should be atomic

Actual outcome

  • Schema for view1 correctly updated (including field1 and field2)
  • Schema for view2 is only including field1
  • Selecting from view2 does actually return field1 and field2

I could delete view2 and re-create it again but that wouldn't be atomic and there would be a time where the view is not available which is undesired.

I also tried to update the schema property of view2 but that is refused with Cannot add fields (field: field2):

google.api_core.exceptions.BadRequest: 400 PATCH https://www.googleapis.com/bigquery/v2/projects/<project-id>/datasets/dataset1/tables/view2: Provided Schema does not match Table <project-id>:dataset1.view2. Cannot add fields (field: field2)

Question

Is there any way to update the view atomically while also updating the schema which was changed indirectly (a table/view the view selects from).

Note: of course my view2 would add additional fields and I could currently determine its schema by creating a new temporary view.

Note: the schema is important because tools such as the Data Studio's BigQuery connector are inspecting the schema.

Code to reproduce the steps

# Python 3.6+
import google.api_core.exceptions
from google.cloud import bigquery


def delete_table_if_exists(client: bigquery.Client, table: bigquery.Table):
    try:
        client.delete_table(table)
    except google.api_core.exceptions.NotFound:
        pass


def full_table_id(table: bigquery.Table) -> str:
    # Note: the documentation says it should be separated by a dot but uses a colon
    return table.full_table_id.replace(':', '.')


def view_test():
    client = bigquery.Client()

    dataset_ref = client.dataset('dataset1')
    try:
        client.create_dataset(dataset_ref)
    except google.api_core.exceptions.Conflict:
        pass

    view1 = bigquery.Table(dataset_ref.table('view1'))
    view2 = bigquery.Table(dataset_ref.table('view2'))
    delete_table_if_exists(client, view1)
    delete_table_if_exists(client, view2)

    view1.view_query = 'SELECT 1 AS field1'
    view1 = client.create_table(view1)

    view2.view_query = f'SELECT * FROM `{full_table_id(view1)}`'
    client.create_table(view2)

    view1.view_query = 'SELECT 1 AS field1, 2 AS field2'
    client.update_table(view1, ['view_query'])

    client.update_table(view2, ['view_query'])
    print('view2 schema:', client.get_table(view2).schema)

    # trying to update the schema fails with 'Cannot add fields (field: field2)'
    view2.schema = client.get_table(view1).schema
    client.update_table(view2, ['schema'])


if __name__ == '__main__':
    view_test()

Bash example doing the same

#!/bin/bash

set -e

project_id=$(gcloud config list --format 'value(core.project)' 2>/dev/null)

bq mk -f dataset1

bq rm -f dataset1.view1
bq rm -f dataset1.view2

bq mk --use_legacy_sql=false --view 'SELECT 1 AS field1' dataset1.view1
bq mk --use_legacy_sql=false --view 'SELECT * FROM `'$project_id'.dataset1.view1`' dataset1.view2

bq update --use_legacy_sql=false --view 'SELECT 1 AS field1, 2 AS field2' dataset1.view1
bq update --use_legacy_sql=false --view 'SELECT * FROM `'$project_id'.dataset1.view1`' dataset1.view2

bq show dataset1.view2

Update: Code with accepted answer

Python code

def get_create_or_replace_view_query(view: bigquery.Table) -> str:
    return f'CREATE OR REPLACE VIEW {view.dataset_id}.{view.table_id} AS {view.view_query}'


def view_test():
    # ...
    query_job = client.query(get_create_or_replace_view_query(view2))
    query_job.result()
    print('view2 schema:', client.get_table(view2).schema)

Bash magic

bq query --use_legacy_sql=false 'CREATE OR REPLACE VIEW dataset1.view2 AS SELECT * FROM `'$project_id'.dataset1.view1`'
1
I don't understand a thing, if you say that view2 selects all fields from view1 the schema is actually a * star operator, so if you modify view 1 the fields are available in view 2 as well. The doc is clear about this limtation: The schemas of the underlying tables are stored with the view when the view is created. If columns are added, deleted, and so on after the view is created, the reported schema will be inaccurate until the view is updated. Even though the reported schema may be inaccurate, all submitted queries produce accurate results cloud.google.com/bigquery/docs/views-introPentium10
@Pentium10 I understand that view2's schema is not updated automatically. I am 'happy' to update it manually but I am not sure how to do that atomically.de1
@Pentium10 I added the step that was previously only mentioned further down that I tried to update view2 as well.de1
You need to use the PATCH method and not the update method see with that. On the other hand there is an alpha listed for information_schema, join that as well: issuetracker.google.com/issues/35906063Pentium10
@Pentium10 do you have more information on that please? The update_table method seems to be using the PATCH HTTP method. I've added a more detailed exception which shows that (when I try to update the schema).de1

1 Answers

4
votes

You should use a CREATE OR REPLACE VIEW statement; see the related documentation. BigQuery provides ACID semantics for all queries that perform table modifications, and CREATE OR REPLACE VIEW is no exception, so this replaces the definition and schema of the view atomically.