1
votes

I have a dataframe like this,

name | scores
Dan  |  [1_10, 2_5, 3_2, 4_12.5]
Ann  |  [2_12.4, 3_4.5, 5_9.3]
Jon  |  [2_1.7]

For each row, I want to extract the number value (split item on underscored and take the index 1) from the items in the scores column which is a string and sum up the column.

My expected answer will look like this,

name | Total
Dan  |  29.5
Ann  |  26.2
Jon  |  1.7 

My data frame is very huge, the array column can contain millions of items in the worst case. Explode based on solution is not working out for me due to huge size of dataframe after explode.

My driver is small and I can't afford to run a UDF to solve this.

Can RDD or map can help here? If so how to use it efficiently? I'm running pyspark 2.3 btw.

2

2 Answers

3
votes

Here's another way without using explode. Get the max size of the scores array column. Then using a list comprehension, sum the elements (extracted float values) of the array by using python sum function :

from pyspark.sql import functions as F

max_size = df.select(F.max(F.size("scores"))).first()[0]

df1 = df.withColumn(
    "Total",
    sum([
        F.coalesce(F.split(F.col("scores")[i], "_")[1], F.lit(0))
        for i in range(max_size)
    ])
)

df1.show(truncate=False)

#+----+------------------------+-----+
#|name|scores                  |Total|
#+----+------------------------+-----+
#|Dan |[1_10, 2_5, 3_2, 4_12.5]|29.5 |
#|Ann |[2_12.4, 3_4.5, 5_9.3]  |26.2 |
#|Jon |[2_1.7]                 |1.7  |
#+----+------------------------+-----+

For Spark 2.4+, better to use transform and aggregate functions as pointed in @mck's answer.

2
votes

You can use aggregate and transform higher order functions:

df2 = df.selectExpr('name', """
    aggregate(
        transform(scores, x -> split(x, '_')[1]),
        double(0),
        (acc, x) -> acc + x
    ) as Total
""")

For older spark versions, try explode and group by:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'scores', F.explode('scores').alias('scores')
).withColumn(
    'scores', F.split('scores', '_')[1]
).groupBy('name').agg(F.sum('scores').alias('Total'))

An RDD solution:

from pyspark.sql import Row

df2 = df.rdd.map(lambda r: Row(name=r.name, Total=sum([float(i.split('_')[1]) for i in r.scores]))).toDF()

df2.show()
+----+-----+
|name|Total|
+----+-----+
| Dan| 29.5|
| Ann| 26.2|
| Jon|  1.7|
+----+-----+