0
votes

In my case how to split a column contain StringType with a format '1-1235.0 2-1248.0 3-7895.2' to another column with ArrayType contains ['1','2','3']

2

2 Answers

1
votes

this is relatively simple with UDF:

val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")

val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))

df.withColumn("newCol", extractFirst($"input"))
  .show()

gives

+--------------------+---------+
|               input|   newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+

I could not find an easy soluton with spark internals (other than using split in combination with explode etc and then re-aggregating)

1
votes

You can split the string to an array using split function and then you can transform the array using Higher Order Function TRANSFORM (it is available since Sark 2.4) together with substring_index:

import org.apache.spark.sql.functions.{split, expr}

val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")

df.withColumn("array", split($"stringCol", " "))
  .withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))

Notice that this is native approach, no UDF applied.