Explode nested arrays in pyspark

Question

I'm looking at the following DataFrame schema (names changed for privacy) in pyspark.

|-- some_data: struct (nullable = true)
|    |-- some_array: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- some_nested_array: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- some_param_1: long (nullable = true)
|    |    |    |    |    |-- some_param_2: string (nullable = true)
|    |    |    |    |    |-- some_param_3: string (nullable = true)
|    |    |    |-- some_param_4: string (nullable = true)
|    |    |    |-- some_param_5: string (nullable = true)
|    |-- some_other_array: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- some_param_6: string (nullable = true)
|    |    |    |-- some_param_7: string (nullable = true)
|    |-- yet_another_array: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- some_param_8: string (nullable = true)
|    |    |    |-- some_param_9: string (nullable = true)

I'm struggling using the explode function on the doubly nested array. I would like ideally to somehow gain access to the paramaters underneath some_array in their own columns so I can compare across some_param_1 through 9 - or even just some_param_1 through 5.

Som Som · Accepted Answer · 2020-05-19T05:48:56

Please convert the column into json and use json_path to fetch each param as column. The sample code is as follows-

df.selectExpr("get_json_object(to_json(struct(some_data)),
 '$.some_data.some_array[0].some_nested_array[0].some_param_1') as some_param_1",
 ...<add_others>).show(false)

// compare each param as column here

If you are not familiar with json path, then-

get json created via -

df.selectExpr(to_json(struct(some_data))).show(false)

Copy the json from the result cell to the left quadrant of https://jsonpathfinder.com/ and here you will see the object tree hierarchy. Now, click on the some_param_1 node
copy the Path on the same page and replace x with $
put it as 2nd param for get_json_object and you are done
once get the individual param columns you can do the processing.

Explode nested arrays in pyspark

2 Answers