Pyspark take the latest updated value from the column

Question

I have a dataframe as following:

+----+--------+--------+------+
| id | value1 | value2 | flag |
+----+--------+--------+------+
|  1 | 7000   | 30     |   0  |
|  2 | 0      | 9      |   0  |
|  3 | 23627  | 17     |   1  |
|  4 | 8373   | 23     |   0  |
|  5 | -0.5   | 4      |   1  |
+----+--------+--------+------+

I want to run following conditions-
1. If value is greater than 0, I want previous rows value2
2. If value is equal to 0, I want the average of previous row and next row's value2
3. If value is less than 0, then NULL
So I wrote the following code-

df = df.withColumn('value2',when(col(value1)>0,lag(col(value2))).when(col(value1)==0,\
                   (lag(col(value2))+lead(col(value2)))/2.0).otherwise(None))

What I want is that I should have the updated value when I am taking the previous and next rows' value, like following. It should go in an order of finding them, first for id-1, update it, then for id-2 take the updated value and so on.

+----+--------+--------+------+
| id | value1 | value2 | flag |
+----+--------+--------+------+
|  1 | 7000   | null   |   0  |
|  2 | 0      | 8.5    |   0  |
|  3 | 23627  | 8.5    |   1  |
|  4 | 8373   | 8.5    |   0  |
|  5 | -0.5   | null   |   1  |
+----+--------+--------+------+

I tried by just giving the id==1 in when,reassign dataframe and then again perform withcolumn,when operations.

df = df.withColumn('value2',when((col(id)==1)&(col(value1)>0,lag(col(value2)))
\.when((col(id)==1)&col(value1)==0,(lag(col(value2))+lead(col(value2)))/2.0)\
.when((col(id)==1)&col(col(value1)<0,None).otherwise(col(value2))

After this I'll get the updated column value and if I do the same operation again for id==2, I can get the updated value. But I certainly cannot do that for every id. How can I achieve this?

skay skay · Accepted Answer · 2018-11-09T22:11:04

from pyspark.sql import SparkSession    
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window


spark = SparkSession \
    .builder \
    .appName('test') \
    .getOrCreate()


tab_data = spark.sparkContext.parallelize(tab_inp)
##
schema = StructType([StructField('id',IntegerType(),True),
                     StructField('value1',FloatType(),True),
                     StructField('value2',IntegerType(),True),
                     StructField('flag',IntegerType(),True)
                    ])

table = spark.createDataFrame(tab_data,schema)
table.createOrReplaceTempView("table")
dummy_df=table.withColumn('dummy',lit('dummy'))
pre_value=dummy_df.withColumn('pre_value',lag(dummy_df['value2']).over(Window.partitionBy('dummy').orderBy('dummy')))

cmb_value=pre_value.withColumn('next_value',lead(dummy_df['value2']).over(Window.partitionBy('dummy').orderBy('dummy')))

new_column=when(col('value1')>0,cmb_value.pre_value) \
            .when(col('value1')<0,cmb_value.next_value)\
            .otherwise((cmb_value.pre_value+cmb_value.next_value)/2)


final_table=cmb_value.withColumn('value',new_column)

Above "final_table" will have field you are expecting.

Pyspark take the latest updated value from the column

2 Answers