Mean across rows of unique elements in ArrayType Pyspark column

Question

I have a big pyspark dataframe (23M rows) with the following format:

names, sentiment
["Lily","Kerry","Mona"], 10
["Kerry", "Mona"], 2
["Mona"], 0

I would like to compute the average sentiment for each unique name in the names column, resulting into:

name, sentiment
"Lily", 10
"Kerry", 6
"Mona", 4

Have you tried a head-on approach of exploding the array, grouping by name and taking average? Is the dataframe too big for that? — mazaneicha

mvasyliv mvasyliv · Accepted Answer · 2020-06-17T20:00:44

   val avgDF = Seq((Seq("Lily","Kerry","Mona"), 10),
      (Seq("Kerry", "Mona"), 2),
      (Seq("Mona"), 0)
  ).toDF("names", "sentiment")

  val avgDF1 = avgDF.withColumn("name", explode('names))
  val avgResultDF = avgDF1.groupBy("name").agg(avg(col("sentiment")))


  avgResultDF.show(false)
  //      +-----+--------------+
  //      |name |avg(sentiment)|
  //      +-----+--------------+
  //      |Lily |10.0          |
  //      |Kerry|6.0           |
  //      |Mona |4.0           |
  //      +-----+--------------+

Mean across rows of unique elements in ArrayType Pyspark column

2 Answers