0
votes

I got the following dataframe

+---+--------------------------------------+
| id|                                 score|
+---+--------------------------------------+
|  1|(math, 90)(physics, 87)(chemistry, 82)|
+---+--------------------------------------+
|  2|                        (computer, 84)|
+---+--------------------------------------+
|  3|                                  null|
+---+--------------------------------------+

in which column score is of type str. I want to convert this str into array<str> like below

+---+--------------------------------------------+
| id|                                       score|
+---+--------------------------------------------+
|  1|['math, 90', 'physics, 87', 'chemistry, 82']|
+---+--------------------------------------------+
|  2|                            ['computer, 84']|
+---+--------------------------------------------+
|  3|                                        null|
+---+--------------------------------------------+

So basically it splits the string to an array and removes all the parenthesis.

I'm referencing this question but I'm not so sure what's the correct regular-expression to use.

Thanks and appreciate for any help.

1

1 Answers

1
votes

This should give you the desired output using only Spark SQL functions.

Doing it purely with Spark SQL functions is a bit convoluted. A UDF may give cleaner results, but not sure about the performance impact.

from pyspark.sql import functions as F

testDF = spark.createDataFrame([
    (1, "(math, 90)(physics, 87)(chemistry, 82)"),
    (2, "(computer, 84)"),
    (3, "null")
], ["id", "score"])

testDF.select('id', F.split('score', '[)][(]').alias('score'))\
    .select('id', F.explode('score').alias('score'))\
    .select('id', F.regexp_replace('score', '[)]|[(]', '').alias('score'))\
    .groupBy('id')\
    .agg(F.collect_list("score").alias("score"))\
    .orderBy('id')\
    .show()

Note that quotes around array/list elements do not appear. But you can see it if you run collect() instead of show().