1
votes

I am trying to iterate over an array of array as a column in Spark dataframe. Looking for the best way to do this.

Schema:

root
 |-- Animal: struct (nullable = true)
 |    |-- Species: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- mammal: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- description: string (nullable = true)

Currently I am using this logic. This only gets the first array.

df.select(
   col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
)

Pseudo Logic:

col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(1).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(2).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(...).getItem("mammal").getItem("description")

Desired Example Output (flattened elements as string)

llama, sheep, rabbit, hare
1

1 Answers

1
votes

You can apply explode two times: first time on Animal.Species and second time on the result of the first time:

import org.apache.spark.sql.functions._
df.withColumn("tmp", explode(col("Animal.Species")))
  .withColumn("tmp", explode(col("tmp.mammal")))
  .select("tmp.description")
  .show()