1
votes

I'm looking at the following DataFrame schema (names changed for privacy) in pyspark.

|-- some_data: struct (nullable = true)
|    |-- some_array: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- some_nested_array: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- some_param_1: long (nullable = true)
|    |    |    |    |    |-- some_param_2: string (nullable = true)
|    |    |    |    |    |-- some_param_3: string (nullable = true)
|    |    |    |-- some_param_4: string (nullable = true)
|    |    |    |-- some_param_5: string (nullable = true)
|    |-- some_other_array: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- some_param_6: string (nullable = true)
|    |    |    |-- some_param_7: string (nullable = true)
|    |-- yet_another_array: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- some_param_8: string (nullable = true)
|    |    |    |-- some_param_9: string (nullable = true)

I'm struggling using the explode function on the doubly nested array. I would like ideally to somehow gain access to the paramaters underneath some_array in their own columns so I can compare across some_param_1 through 9 - or even just some_param_1 through 5.

2
can you show your code & what errors you are getting ? - Srinivas

2 Answers

0
votes

Please convert the column into json and use json_path to fetch each param as column. The sample code is as follows-

df.selectExpr("get_json_object(to_json(struct(some_data)),
 '$.some_data.some_array[0].some_nested_array[0].some_param_1') as some_param_1",
 ...<add_others>).show(false)

// compare each param as column here

If you are not familiar with json path, then-

  1. get json created via -
df.selectExpr(to_json(struct(some_data))).show(false)
  1. Copy the json from the result cell to the left quadrant of https://jsonpathfinder.com/ and here you will see the object tree hierarchy. Now, click on the some_param_1 node
  2. copy the Path on the same page and replace x with $
  3. put it as 2nd param for get_json_object and you are done
  4. once get the individual param columns you can do the processing.
0
votes

You can explode the nested arrays in two steps: first explode the outer array and then the nested inner array:

from pyspark.sql.functions import explode

df2 = df.withColumn("exploded_some_array", \
                    explode("some_data.some_array"))
df2 = df2.withColumn("exploded_some_nested_array", \
                     explode("exploded_some_array.some_nested_array"))
df2.select("exploded_some_nested_array.*", \
           "exploded_some_array.some_param_4", \
           "exploded_some_array.some_param_5").show(truncate=False)

prints the 5 columns some_param_1 to some_param_5.

Adding the columns of the remaining two arrays is straight forward:

df2 = df2.withColumn("exploded_some_other_array", explode("some_data.some_other_array"))
df2 = df2.withColumn("exploded_yet_another_array", explode("some_data.yet_another_array"))
df2.select("exploded_some_nested_array.*", \
           "exploded_some_array.some_param_4", \
           "exploded_some_array.some_param_5", \
           "exploded_some_other_array.*", \
           "exploded_yet_another_array.*").show(truncate=False)

prints all columns some_param_1 to some_param_9 as a flattened table.