0
votes

I have a DataFrame with the following schema

root
 |-- col_a: string (nullable = false)
 |-- col_b: string (nullable = false)
 |-- col_c_a: string (nullable = false)
 |-- col_c_b: string (nullable = false)
 |-- col_d: string (nullable = false)
 |-- col_e: string (nullable = false)
 |-- col_f: string (nullable = false)

now I want to convert the Schema for this data frame to something like this.

root
 |-- col_a: string (nullable = false)
 |-- col_b: string (nullable = false)
 |-- col_c: struct (nullable = false)
     |-- col_c_a: string (nullable = false)
     |-- col_c_b: string (nullable = false)
 |-- col_d: string (nullable = false)
 |-- col_e: string (nullable = false)
 |-- col_f: string (nullable = false)

I can able to do this with the help of map transformation by explicitly fetching the value of each column from row type but this is very complex process and does not look good So,

is there any way I can achieve this?

Thanks

1

1 Answers

1
votes

There is an in-built struct function with the definition :

def struct(cols: Column*): Column

You can use it like :

df.show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  2|  3|
+---+---+

df.withColumn("struct_col", struct($"a", $"b")).show
+---+---+----------+
|  a|  b|struct_col|
+---+---+----------+
|  1|  2|     [1,2]|
|  2|  3|     [2,3]|
+---+---+----------+

The schema of the new dataframe being :

 |-- a: integer (nullable = false)
 |-- b: integer (nullable = false)
 |-- struct_col: struct (nullable = false)
 |    |-- a: integer (nullable = false)
 |    |-- b: integer (nullable = false)

In you case, you can do something like :

df.withColumn("col_c" , struct($"col_c_a", $"col_c_b") ).drop($"col_c_a").drop($"col_c_b")