Spark BigQuery Connector: Writing ARRAY type causes exception: ""Invalid value for: ARRAY is not a valid value""

Question

Running a Spark job in Google Cloud Dataproc. Using the BigQuery Connector to load json data output from the job into a BigQuery table.

BigQuery Standard-SQL data types documentation states the ARRAY type is supported.

My Scala code is:

val outputDatasetId = "mydataset"
val tableSchema = "["+
    "{'name': '_id', 'type': 'STRING'},"+
    "{'name': 'array1', 'type': 'ARRAY'},"+
    "{'name': 'array2', 'type': 'ARRAY'},"+
    "{'name': 'number1', 'type': 'FLOAT'}"+
    "]"

// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
    conf, projectId, outputDatasetId, "outputTable", 
    tableSchema)

//Write visits to BigQuery
jsonData.saveAsNewAPIHadoopDataset(conf)

But the job throws this exception:

{
  "code" : 400,
  "errors" : [ {
  "domain" : "global",
  "message" : "Invalid value for: ARRAY is not a valid value",
  "reason" : "invalid"
   } ],
  "message" : "Invalid value for: ARRAY is not a valid value"
}
    at 

com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAnThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:432)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:287)
    at com.google.cloud.hadoop.io.bigquery.BigQueryRecordWriter.close(BigQueryRecordWriter.java:358)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1124)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1366)
    ... 8 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 
400 Bad Request

Is this a Legacy vs. Standard SQL issue? Or is ARRAY type not supported by the BigQuery Connector for Spark?

how to handle null values with spark bigquery connector. is there any other option apart from using dataframe.na functions. Thanks — Phoenix

Willian Fuks Willian Fuks · Accepted Answer · 2017-06-06T23:34:00

Instead of using type=ARRAY, try setting the type as you normally would but also set the key mode=REPEATED.

An array of strings for instance would be defined as:

{'name': 'field1', 'type': 'STRING', 'mode': 'REPEATED'}

Spark BigQuery Connector: Writing ARRAY type causes exception: ""Invalid value for: ARRAY is not a valid value""

2 Answers