pyspark — best way to sum values in column of type Array(StringType()) after splitting

Question

I have a dataframe like this,

name | scores
Dan  |  [1_10, 2_5, 3_2, 4_12.5]
Ann  |  [2_12.4, 3_4.5, 5_9.3]
Jon  |  [2_1.7]

For each row, I want to extract the number value (split item on underscored and take the index 1) from the items in the scores column which is a string and sum up the column.

My expected answer will look like this,

name | Total
Dan  |  29.5
Ann  |  26.2
Jon  |  1.7

My data frame is very huge, the array column can contain millions of items in the worst case. Explode based on solution is not working out for me due to huge size of dataframe after explode.

My driver is small and I can't afford to run a UDF to solve this.

Can RDD or map can help here? If so how to use it efficiently? I'm running pyspark 2.3 btw.

blackbishop blackbishop · Accepted Answer · 2021-02-03T18:00:46

Here's another way without using explode. Get the max size of the scores array column. Then using a list comprehension, sum the elements (extracted float values) of the array by using python sum function :

from pyspark.sql import functions as F

max_size = df.select(F.max(F.size("scores"))).first()[0]

df1 = df.withColumn(
    "Total",
    sum([
        F.coalesce(F.split(F.col("scores")[i], "_")[1], F.lit(0))
        for i in range(max_size)
    ])
)

df1.show(truncate=False)

#+----+------------------------+-----+
#|name|scores                  |Total|
#+----+------------------------+-----+
#|Dan |[1_10, 2_5, 3_2, 4_12.5]|29.5 |
#|Ann |[2_12.4, 3_4.5, 5_9.3]  |26.2 |
#|Jon |[2_1.7]                 |1.7  |
#+----+------------------------+-----+

For Spark 2.4+, better to use transform and aggregate functions as pointed in @mck's answer.

pyspark — best way to sum values in column of type Array(StringType()) after splitting

2 Answers