Unable to query BigQuery external table on partitioned data with spark

Question

I was trying to create an external table on the partitioned data in GCS written from a spark job with date partitioned in PARQUET format.

The data is the GCS bucket is as shown the picture .

I created an external table with table definition

  "hivePartitioningOptions": {
    "mode": "AUTO",
    "sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data"
  },
  "sourceFormat": "PARQUET",
  "sourceUris": [
    "gs://transaction_data_bucket_for_bigquery/trx_data/*"
  ]
}

with command

bq mk --external_table_definition=/tmp/table_def <project>:<dataset>.sample_trx_external

When i try to query the table, i am getting a strange error.

Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01.; Cannot add hive partitioning to table <data_set>.sample_trx_external -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01..

Unfortunately i couldn't able to decipher the message. There are only 1 day of transactions written to the GCS bucket .

When i tried with CUSTOM mode definition

{
  "hivePartitioningOptions": {
    "mode": "CUSTOM",
    "sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data/{transaction_date:DATE}"
  },
  "sourceFormat": "PARQUET",
  "sourceUris": [
    "gs://transaction_data_bucket_for_bigquery/trx_data/*"
  ]
}

i got a slightly different error

Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data.; Cannot add hive partitioning to table <data_Set>.sample_trx_external_2 -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data..

Got struck here and any suggestion would be a great help .

In which locations your BigQuery dataset and GCS bucket are placed? Have you checked limitations? — aga
Yep . I checked the limitations . The data format is in the supported Hive format structure. Regarding the data location , initially when the data is kept in EU regional bucket , if failed with an error that data cannot be accessible , but later when i crated a mutli-regional EU bucket , i didnt see this error . Bigquery dataset is also in the same region . — Srinivas Jill
solved the issue . _SUCCESS is the culprit. The parquet file generation from spark writes this additional file and this is also being looked by the external table definition . — Srinivas Jill

Srinivas Jill Srinivas Jill · Accepted Answer · 2020-05-05T17:25:49

As shown in the attached image in the question there is a file _SUCCESS which was written by spark job which created the partitioned data set . The problem here is under "gs://transaction_data_bucket_for_bigquery/trx_data/" path big-query external table is expecting every directory or file as a partitioned format. Definitely _SUCCESS doesn't fallow this structure which resulted in the above error message.

Fix: I just deleted the file and every thing works as expected .

Unable to query BigQuery external table on partitioned data with spark

1 Answers