I have a dataframe with schema like this:
|-- order: string (nullable = true)
|-- travel: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- place: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- address: string (nullable = true)
| | | |-- latitude: double (nullable = true)
| | | |-- longitude: double (nullable = true)
| | |-- distance_in_kms: float (nullable = true)
| | |-- estimated_time: struct (nullable = true)
| | | |-- seconds: long (nullable = true)
| | | |-- nanos: integer (nullable = true)
I want to get the seconds in estimated_time
and convert it into a string and concatenate it with s
, and then replace estimated_time
with the new string value. For example, { "seconds": "988", "nanos": "102" }
will be converted to 988s
, so the schema will change to
|-- order: string (nullable = true)
|-- travel: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- place: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- address: string (nullable = true)
| | | |-- latitude: double (nullable = true)
| | | |-- longitude: double (nullable = true)
| | |-- distance_in_kms: float (nullable = true)
| | |-- estimated_time: string (nullable = true)
How can I do this in PySpark?
More concrete example, I want to transform this DF (visualized in JSON)
{
"order": "c-331",
"travel": [
{
"place": {
"name": "A place",
"address": "The address",
"latitude": 0.0,
"longitude": 0.0
},
"distance_in_kms": 1.0,
"estimated_time": {
"seconds": 988,
"nanos": 102
}
}
]
}
into
{
"order": "c-331",
"travel": [
{
"place": {
"name": "A place",
"address": "The address",
"latitude": 0.0,
"longitude": 0.0
},
"distance_in_kms": 1.0,
"estimated_time": "988s"
}
]
}