I have a dataframe as following:
+----+--------+--------+------+
| id | value1 | value2 | flag |
+----+--------+--------+------+
| 1 | 7000 | 30 | 0 |
| 2 | 0 | 9 | 0 |
| 3 | 23627 | 17 | 1 |
| 4 | 8373 | 23 | 0 |
| 5 | -0.5 | 4 | 1 |
+----+--------+--------+------+
I want to run following conditions-
1. If value is greater than 0, I want previous rows value2
2. If value is equal to 0, I want the average of previous row and next row's value2
3. If value is less than 0, then NULL
So I wrote the following code-
df = df.withColumn('value2',when(col(value1)>0,lag(col(value2))).when(col(value1)==0,\
(lag(col(value2))+lead(col(value2)))/2.0).otherwise(None))
What I want is that I should have the updated value when I am taking the previous and next rows' value, like following. It should go in an order of finding them, first for id-1, update it, then for id-2 take the updated value and so on.
+----+--------+--------+------+
| id | value1 | value2 | flag |
+----+--------+--------+------+
| 1 | 7000 | null | 0 |
| 2 | 0 | 8.5 | 0 |
| 3 | 23627 | 8.5 | 1 |
| 4 | 8373 | 8.5 | 0 |
| 5 | -0.5 | null | 1 |
+----+--------+--------+------+
I tried by just giving the id==1 in when,reassign dataframe and then again perform withcolumn,when operations.
df = df.withColumn('value2',when((col(id)==1)&(col(value1)>0,lag(col(value2)))
\.when((col(id)==1)&col(value1)==0,(lag(col(value2))+lead(col(value2)))/2.0)\
.when((col(id)==1)&col(col(value1)<0,None).otherwise(col(value2))
After this I'll get the updated column value and if I do the same operation again for id==2, I can get the updated value. But I certainly cannot do that for every id. How can I achieve this?
id==1? - pvy4917