How can you do the same thing as df.fillna(method='bfill')
for a pandas dataframe with a pyspark.sql.DataFrame
?
The pyspark dataframe has the pyspark.sql.DataFrame.fillna
method, however there is no support for a method
parameter.
In pandas you can use the following to backfill a time series:
Create data
import pandas as pd
index = pd.date_range('2017-01-01', '2017-01-05')
data = [1, 2, 3, None, 5]
df = pd.DataFrame({'data': data}, index=index)
Giving
Out[1]:
data
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 NaN
2017-01-05 5.0
Backfill the dataframe
df = df.fillna(method='bfill')
Produces the backfilled frame
Out[2]:
data
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 5.0
2017-01-05 5.0
How can the same thing be done for a pyspark.sql.DataFrame
?