10
votes

Are there any tools that will validate a JSON string against a BigQuery schema? I'd like to load valid ones to BQ, and re-process invalid ones.

I know that you can validate against a standard JSON schema using (e.g.) python's jsonschema, is there something similar for BQ schemas?


Re Pentium10's comment, I can imagine a number of ETL scenarios where data from several sources has to be assembled such that it matches a BQ schema - currently I need 2 schemas for the data, a JSON Schema, and a BQ schema - I validate against the JSON schema and hope that this is enough to satisfy the BQ schema on submission.


Specifically: in this situation, I have JSON which has arrived from a javascript front end, and been entered into BQ as a string. I want to process this field, and add it to BQ as a table in its own right, so that I can search it.

The JSON (more or less) falls into 2 'schemas', but it is poorly TYPED ( i.e. numbers are treated as strings, lists of length 1 are strings, not lists...). I want a quick way to see whether a field would go into the table, and it seemed a little silly that I have a BQ table schema, but cannot validate against it - rather, I must also create a JSON schema for the idealised data and must check against that.

4
What would be a real use case scenario for such feature?Pentium10
I have JSON which has arrived from a javascript front end, and been entered into BQ as a string.Jonathan Miller

4 Answers

0
votes

If you re-express your schema in JSON-schema (http://json-schema.org/implementations.html) then you should be able to use one of the tools they list to do your validation.

0
votes

I would suggest that you use your JSON schema as a JSON object in Python, with this you could try to validate the schema using BigQuery's library.

1 - Request the Schema out of a BigQuery Table (should be then dynamically implemented):

from google.cloud import bigquery
client = bigquery.Client(project='your_project')
dataset_ref = client.dataset('your_dataset')
table_ref = dataset_ref.table('your_table_name')
table_helper = client.get_table(table_ref)

2 - Get the schema and format it as a JSON, after it you should be able to compare the two schemas.

What you have now is a list containing SchemaField()

your_schema = table_helper.schema

You could try to format a list and then dump it into a JSON object...

formatted_list_schema = ["'{0}','{1}','{2}',{3},{4}".format(schema.name,schema.field_type,schema.mode,schema.description,schema.fields) for schema in table_helper.schema]

json_bq_schema = json.dumps(formatted_list_schema)

You could try to format that BQ-JSON-Schema in order to compare it as they do it here: How to compare two JSON objects with the same elements in a different order equal?

I know that this is not a solution easy to implement, but I guess if you tweak it good enough, it will be robust and can solve your problem. Feel free to ask if I can help you more...

Check for more info about schemas https://cloud.google.com/bigquery/docs/schemas

0
votes

It's hard to answer without any examples provided, but you can use jsonschema for that generally.

Here's metaschema definition in YAML:

"$schema": http://json-schema.org/draft-07/schema

title: Metaschema for BigQuery fields definition schemas
description: "See also: https://cloud.google.com/bigquery/docs/schemas"

type: array
minItems: 1
uniqueItems: yes

items:
  "$id": "#/items"
  title: Single field definition schema
  type: object

  examples:

  - name: Item_Name
    type: STRING
    mode: NULLABLE
    description: Name of catalog item

  - name: Item_Category
    type: STRING
    mode: REQUIRED

  - name: Exchange_Rate
    type: NUMERIC

  additionalProperties: no
  required:
  - name
  - type

  properties:

    name:
      "$id": "#/items/properties/name"
      title: Name of field
      description: "See also: https://cloud.google.com/bigquery/docs/schemas#column_names"
      type: string
      minLength: 1
      maxLength: 128
      pattern: "^[a-zA-Z_]+[a-zA-Z0-9_]*$"
      examples:
      - Item_Name
      - Exchange_Rate

    description:
      "$id": "#/items/properties/description"
      title: Description of field
      description: "See also: https://cloud.google.com/bigquery/docs/schemas#column_descriptions"          
      type: string
      maxLength: 1024

    type:
      "$id": "#/items/properties/type"
      title: Name of BigQuery data type
      description: 'See also: https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types'
      type: string
      enum:
      - INTEGER
      - FLOAT
      - NUMERIC
      - BOOL
      - STRING
      - BYTES
      - DATE
      - DATETIME
      - TIME
      - TIMESTAMP
      - GEOGRAPHY

    mode:
      "$id": "#/items/properties/mode"
      title: Mode of field
      description: 'See also: https://cloud.google.com/bigquery/docs/schemas#modes'
      type: string
      default: NULLABLE
      enum:
      - NULLABLE
      - REQUIRED
      - REPEATED

This is the most precise metaschema I've been able to generate from GCP docs. Structures and arrays are not supported here, though.

YAML is just for readability here and you can easily convert it into JSON if needed.

Assuming the metaschema from above is saved as "/path/to/metaschema.yaml", the usage is the following:

import json

from pathlib import Path

import jsonschema
import yaml


metaschema = yaml.safe_load(Path("/path/to/metaschema.yaml").read_text())

schema = """[{"name": "foo", "type": "STRING"}]"""
schema = json.loads(schema)


jsonschema.validate(schema, metaschema)

The yaml module from above is provided by PyYAML package.

If the schema is valid, jsonschema.validate() function will simply pass. Otherwise, jsonschema.exceptions.ValidationError will be thrown with error explanation.

It's up to you whether to use JSON or YAML and how to store and parse schemas.

Also it's up to you whether to convert names of types and modes to upper-/lowercase.

0
votes

This is one of the implementations that I created. https://github.com/toshi0607/bq-schema-validator

It's a bit fuzzy, but it usually detects the error-prone field in a JSON log.