I am new to Pyspark. I have a Pyspark dataframe and I want to drop duplicates based on the id and timestamp column. I then want to replace the reading value for the duplicate id to null. I do not want to use Pandas. Please see below:
Dataframe:
id reading timestamp
1 13015 2018-03-22 08:00:00.000
1 14550 2018-03-22 09:00:00.000
1 14570 2018-03-22 09:00:00.000
2 15700 2018-03-22 08:00:00.000
2 16700 2018-03-22 09:00:00.000
2 18000 2018-03-22 10:00:00.000
Desired output:
id reading timestamp
1 13015 2018-03-22 08:00:00.000
1 Null 2018-03-22 09:00:00.000
2 15700 2018-03-22 08:00:00.000
2 16700 2018-03-22 09:00:00.000
2 18000 2018-03-22 10:00:00.000
How do I need to add to this code:
df.dropDuplicates(['id','timestamp'])
Any help would be much appreciated. Many thanks