1
votes

Running a Spark job in Google Cloud Dataproc. Using the BigQuery Connector to load json data output from the job into a BigQuery table.

BigQuery Standard-SQL data types documentation states the ARRAY type is supported.

My Scala code is:

val outputDatasetId = "mydataset"
val tableSchema = "["+
    "{'name': '_id', 'type': 'STRING'},"+
    "{'name': 'array1', 'type': 'ARRAY'},"+
    "{'name': 'array2', 'type': 'ARRAY'},"+
    "{'name': 'number1', 'type': 'FLOAT'}"+
    "]"

// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
    conf, projectId, outputDatasetId, "outputTable", 
    tableSchema)

//Write visits to BigQuery
jsonData.saveAsNewAPIHadoopDataset(conf)

But the job throws this exception:

{
  "code" : 400,
  "errors" : [ {
  "domain" : "global",
  "message" : "Invalid value for: ARRAY is not a valid value",
  "reason" : "invalid"
   } ],
  "message" : "Invalid value for: ARRAY is not a valid value"
}
    at 

com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAnThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:432)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:287)
    at com.google.cloud.hadoop.io.bigquery.BigQueryRecordWriter.close(BigQueryRecordWriter.java:358)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1124)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1366)
    ... 8 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 
400 Bad Request

Is this a Legacy vs. Standard SQL issue? Or is ARRAY type not supported by the BigQuery Connector for Spark?

2
how to handle null values with spark bigquery connector. is there any other option apart from using dataframe.na functions. ThanksPhoenix

2 Answers

7
votes

Instead of using type=ARRAY, try setting the type as you normally would but also set the key mode=REPEATED.

An array of strings for instance would be defined as:

{'name': 'field1', 'type': 'STRING', 'mode': 'REPEATED'}
2
votes

Are these arrays of strings? Integers? I believe that using this API, you need to set the type to the element type, e.g. STRING or INT64, but use a mode of REPEATED. The BigQuery APIs have not yet been fully updated to use standard SQL types everywhere, so you need to use the legacy convention of type + mode instead.