0
votes

I'm using Pyspark 2.4 and would like to create df_2 from df_1:

df_1:

root
 |-- request: array (nullable = false)
 |    |-- address: struct (nullable = false)
 |    |    |-- street: string (nullable  = false)
 |    |    |-- postcode: string (nullable  = false)

df_2:

root
 |-- request: array (nullable = false)
 |    |-- address: struct (nullable = false)
 |    |    |-- street: string (nullable  = false)

I know UDF is one way, but are there any other ways, like the use of map(), to achieve the same goal?

1

1 Answers

1
votes

Use transform function :

df_2 = df_1.withColumn("request", expr("transform(request, x -> struct(x.street) as address)"))

For each element of request array, we select only street field and create a new struct.