In my case how to split a column contain StringType with a format '1-1235.0 2-1248.0 3-7895.2' to another column with ArrayType contains ['1','2','3']
2 Answers
1
votes
this is relatively simple with UDF:
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")
val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))
df.withColumn("newCol", extractFirst($"input"))
.show()
gives
+--------------------+---------+
| input| newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+
I could not find an easy soluton with spark internals (other than using split
in combination with explode
etc and then re-aggregating)
1
votes
You can split the string to an array using split
function and then you can transform the array using Higher Order Function TRANSFORM
(it is available since Sark 2.4) together with substring_index
:
import org.apache.spark.sql.functions.{split, expr}
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")
df.withColumn("array", split($"stringCol", " "))
.withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))
Notice that this is native approach, no UDF applied.