2
votes

I exported a df from pyspark to BigQuery. The df has columns with array elements, how can i convert the arrays into concatenated string ? Whenever i try to query the array column of the exported BigQuery table, i get the below error.

Error: Cannot access field element on a value with type ARRAY<STRUCT<element STRING>> 

Below is the df in pyspark which was exported to BigQuery:

+-----------+----------+-------------------+------------------+
| antecedent|consequent|         confidence|              lift|
+-----------+----------+-------------------+------------------+
|[19, 6, 39]|      [44]|  0.418945963975984|10.602038775664845|
|[19, 6, 39]|      [35]|0.47298198799199465| 7.423073393311932|
|[19, 6, 39]|     [322]|   0.47631754503002|4.6520592549063995|
|[19, 6, 39]|     [308]| 0.4496330887258172| 5.856259667533207|
|[19, 6, 39]|     [345]| 0.5196797865243495|5.4970336458402285|

Below is the schema of the table in BigQuery which was generated after exporting the df from pyspark to bigquery:

enter image description here

When viewed as a table this is what it looks like:

enter image description here

1
provide the SQL statement of your query, so we can fix it for you - Mikhail Berlyant

1 Answers

0
votes

This is a known issue when using parquet as intermediate format to write to BigQuery. Can you please switch to ORC? The way to do it is:

df.write.format("bigquery").option("table","...").option("temporaryGcsBucket","...").option("intermediateFormat","orc").save()