2
votes

I have a column col1 that represents a GPS coordinate format:

25 4.1866N 55 8.3824E

I would like to split it in multiple columns based on white-space as separator, as in the output example table_example below:

| 1st_split  | 2nd_split   | 3rd_split    | 4th_split    |
|:-----------|------------:|:------------:|:------------:|
|    25      |   4.1866N   |     55       |     8.3824E  |

Considering the fact that there is the split() function, I have tried in this way:

SELECT explode(split(`col1`, ' ')) AS `col` FROM table_example;

But, instead of splitting per multiple columns, it splits per multiple rows, like in the output below:

output

Can someone clarify me which would be the worth approach for get the expected result?

1

1 Answers

3
votes

If you have a dataframe as

+---------------------+
|col                  |
+---------------------+
|25 4.1866N 55 8.3824E|
+---------------------+

Using Scala API

You can simply use split inbuilt function and select appropriately as

import org.apache.spark.sql.functions._
df.withColumn("split", split(col("col"), " "))
    .select(col("split")(0).as("1st_split"), col("split")(1).as("2nd_split"),col("split")(2).as("3rd_split"),col("split")(3).as("4th_split"))
  .show(false)

which would give you

+---------+---------+---------+---------+
|1st_split|2nd_split|3rd_split|4th_split|
+---------+---------+---------+---------+
|25       |4.1866N  |55       |8.3824E  |
+---------+---------+---------+---------+

Using SQL way

Sql is much easier and similar to the api way

df.createOrReplaceTempView("table_example")
val splitted = sqlContext.sql("SELECT split(`col`, ' ') AS `col` FROM table_example")

splitted.createOrReplaceTempView("splitted_table")
val result = sqlContext.sql("SELECT `col`[0] AS `1st_split`, `col`[1] AS `2nd_split`, `col`[2] AS `3rd_split`, `col`[3] AS `4th_split` FROM splitted_table")
result.show(false)

I hope the answer is helpful