1
votes

I have a big pyspark dataframe (23M rows) with the following format:

names, sentiment
["Lily","Kerry","Mona"], 10
["Kerry", "Mona"], 2
["Mona"], 0

I would like to compute the average sentiment for each unique name in the names column, resulting into:

name, sentiment
"Lily", 10
"Kerry", 6
"Mona", 4
2
Have you tried a head-on approach of exploding the array, grouping by name and taking average? Is the dataframe too big for that?mazaneicha
I did not sorry, that actually worked wellFerran

2 Answers

1
votes
   val avgDF = Seq((Seq("Lily","Kerry","Mona"), 10),
      (Seq("Kerry", "Mona"), 2),
      (Seq("Mona"), 0)
  ).toDF("names", "sentiment")

  val avgDF1 = avgDF.withColumn("name", explode('names))
  val avgResultDF = avgDF1.groupBy("name").agg(avg(col("sentiment")))


  avgResultDF.show(false)
  //      +-----+--------------+
  //      |name |avg(sentiment)|
  //      +-----+--------------+
  //      |Lily |10.0          |
  //      |Kerry|6.0           |
  //      |Mona |4.0           |
  //      +-----+--------------+
1
votes

Simply explode the array and then group

Pyspark equivalent

import pyspark.sql.functions as f
df1 = df.select(f.explode('names').alias('name'),'sentiment')

df1.groupBy('name').agg(f.avg('sentiment').alias('sentiment')).show()