0
votes

I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files. I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.

The way we now handle hive is that we generate the ddl and then execute it with a rest call.

I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library. I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call. Here's a snippet of the code I use:

....
ExternalTableDefinition extTableDef = 
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);   
....

There is however support for partitions for non external tables.

I have a few questions questions:

  • is there support for creating external tables with partition(s)? Can you please point me in the right direction
  • is loading the data into BigQuery preferred to having it stored in GS avro files?
  • if yes, how would we deal with schema evolution?

thank you very much in advance

1

1 Answers

2
votes
  1. You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
  2. If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
  3. In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.

See also a relevant blog post about lazy data loading.