3
votes

Currently I use Parquet files to import data into BigQuery (GIS). One of the Parquet files holds geometry column, representing geometry data as WKT string (MultiPolygon) and I want to import that column as GEOGRAPHY type.

Q1. Documentation mentions that automatic conversion from WKT (string) to GEOGRAPHY is not supported, so how can I overcome this issue? I want to avoid using CSV files, and providing schema definitions manually.

However, even if I create empty table first, and only want to append new Parquet file, it's also not working: Provided Schema does not match Table <org>-internal:test.<table>. Field geom has changed type from GEOGRAPHY to STRING.. (It's same with Avro)

Q2. Is there an option to somehow force casting Parquet's column holding WKT strings to GEOGRAPHY type? I don't want to load temporary table first and then run a query, that does all the casting, to load the target table.

Update: When using python client, I'm able to specify the schema manually for the Parquet file, that I want to import. After loading is completed, all values in GEOGRAPHY column has value INVALID. The same code works fine with CSV.

from google.cloud import bigquery

client = bigquery.Client(project='<project>')

table_ref = client.dataset('test').table('geometry')

job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.schema = [
    bigquery.SchemaField('id', 'INTEGER'),
    bigquery.SchemaField('geom', 'GEOGRAPHY'),
]

# CSV works fine!
# uri = 'gs://<bucket>/multipoly_sample.csv'
# job_config.source_format = bigquery.SourceFormat.CSV

# With Parquet, values in geom column are all "INVALID" 
uri = 'gs://<bucket>/multipoly_sample.parquet'
job_config.source_format = bigquery.SourceFormat.PARQUET

load_job = client.load_table_from_uri(
    uri,
    table_ref,
    job_config=job_config)
load_job.result()
1

1 Answers

1
votes

This is not yet possible during GIS beta. It will be available before BigQuery GIS goes GA, but for now you cannot load directly using Parquet. During Beta you can load directly to Geography column using CSV, newline-delimited-JSON or streaming load API.

Update (Oct 30th, 2018): this should work now. You can either specify GEOGRAPHY as column type, or load into existing table with GEOGRAPHY column. The string column from Parquet/Avro will be parsed and converted to GEOGRAPHY.