flattening array of struct in pyspark

Question

I have an XML file converted to dataframe using spark-xml package. The dataframe has the following structure:

root
 |-- results: struct (nullable = true)
 |    |-- result: struct (nullable = true)
 |    |    |-- categories: struct (nullable = true)
 |    |    |    |-- category: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- value: string (nullable = true)

if I select the category column(which may appear multiple times under categories):

df.select((col('results.result.categories.category')).alias("result_categories"))

For one record, The result will look like

[[result1], [result2]]

I'm trying to flatten the results:

[result1, result2]

When I use the flatten function, I'm getting an error message:

df.select(flatten(col('results.result.categories.category')).alias("Hits_Category"))
 cannot resolve 'flatten(`results`.`result`.`categories`.`category`)' due to data type mismatch: The argument should be an array of arrays, but '`results`.`result`.`categories`.`category`' is of array<struct<value:string>

I end up creating a udf, and pass the column to the udf which spits out a flattened string version of the column.

Is there a better way?

blackbishop blackbishop · Accepted Answer · 2020-02-24T17:53:56

You're trying to apply flatten function for an array of structs while it expects an array of arrays:

flatten(arrayOfArrays) - Transforms an array of arrays into a single array.

You don't need UDF, you can simply transform the array elements from struct to array then use flatten.

Something like this:

df.select(col('results.result.categories.category').alias("result_categories"))\
  .withColumn("result_categories", expr("transform(result_categories, x -> array(x.*))"))\
  .select(flatten(col("result_categories")).alias("Hits_Category"))\
  .show()

flattening array of struct in pyspark

2 Answers