I have a PySpark dataframe with three columns. The first two column have arrays as their elements, while the last column gives the length of arrays of the last column. Following is the PySpark dataframe:
+---------------------+---------------------+-----+
| c1| c2|lenc2|
+---------------------+---------------------+-----+
|[2017-02-14 00:00:00]|[2017-02-24 00:00:00]| 1|
|[2017-01-16 00:00:00]| []| 0|
+---------------------+---------------------+-----+
The arrays contain timestamp dtypes. The column lenc2
denotes length of array in column c1
. For all rows where lenc2==0
, the column c1
has only one (timestamp) element.
For all rows where lenc2==0
, I want to take the timestamp from the array in column c1
, add 5 days to it and put it inside the array in row c2
. How can I do this?
This is an example of the expected output:
+---------------------+---------------------+-----+
| c1| c2|lenc2|
+---------------------+---------------------+-----+
|[2017-02-14 00:00:00]|[2017-02-24 00:00:00]| 1|
|[2017-01-16 00:00:00]|[2017-01-21 00:00:00]| 0|
+---------------------+---------------------+-----+
Below is what I have tried till now:
df2 = df1.withColumn(
"c2",
F.when(F.col("lenc2") == 0, F.array_union(F.col("c1"), F.col("c2"))).otherwise(
F.col("c2")
),
)