I have another solution, but I prefer to use PySpark 2.3 to do it.
I have a two dimensional PySpark data frame like this:
Date | ID
---------- | ----
08/31/2018 | 10
09/31/2018 | 10
09/01/2018 | null
09/01/2018 | null
09/01/2018 | 12
I wanted to replace ID
null values by looking for the closest in the past, or if that value is null, by looking forward (and if it is again null, set a default value)
I have imagined adding a new column with .withColumn
and use a UDF function which will query the data frame itself.
Something like that in pseudo code (not perfect but it is the main idea):
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def return_value(value,date):
if value is not null:
return val
value1 = df.filter(df['date']<= date).select(df['value']).collect()
if (value1)[0][0] is not null:
return (value1)[0][0]
value2 = df.filter(tdf['date']>= date).select(df['value']).collect()
return (value2)[0][0]
value_udf = udf(return_value,StringType())
new_df = tr.withColumn("new_value", value_udf(df.value,df.date))
But it does not work. Am I completely on the wrong way to do it? Is it only possible to query a Spark data frame in a UDF function? Did I miss an easier solution?