I am currently migrate my scripts from pandas to pyspark. I got inconsistent results from a simple groupBy
function in pandas and pyspark and I am confused. I appreciate if some one can help me.
My panel data set looks like
RIC date hour minute volume
VOD 01-01 9 11 55
VOD 01-01 9 11 55
VOD 01-01 10 12 79
VOD 01-01 10 13 55
VOD 01-01 10 15 245
VOD 01-01 11 14 356
VOD 01-02 11 15 6798
... ... ... ... ...
BAT 01-01 9 11 556
BAT 01-02 9 12 552
BAT 01-02 9 14 551
... ... ... .... ...
In pandas, I used the following code to get the total volume for each stock in each minute
df=pd.read_csv(r'/home/user/stock.csv', parse_dates=[1])
df_volume=df.groupby(['RIC','date','hour','minute']).sum().reset_index()
df_volume.head(5)
Then I got the correct output
RIC date hour minute volume
VOD 01-01 9 11 110
VOD 01-01 10 12 79
VOD 01-01 10 13 55
... ... ... .... ...
However, when I code in spark, I used the following
df=spark.read.format('csv').option('header','true').load('/home/user/stock.csv')
df.printSchema()
root |-- RIC: string (nullable = true) |-- date: date (nullable = true) |-- hour: float (nullable = true) |-- minute: float (nullable = true) |-- volume: float (nullable = true)
Then I coded
from pyspark.sql.functions import countDistinct, avg,stddev,count, sum
df_volume=df.groupBy(['RIC','date','hour','minute']).agg(sum(volume))
df_volume.orderBy(['RIC','date','hour','minute'],ascending=[True,True,True])
df_volume.show()
Then I got the incorrect output
+----+--------+-------+----------+----------+
RIC date hour minute volume
+----+--------+-------+----------+----------+
VOD | 01-02 | 10 | 13 | 355 |
VOD | 01-03 | 14 | 03 | 357 |
VOD | 01-05 | 15 | 45 | 683 |
... ... ... .... ...
This time a few observations are missing in spark output. I think I coded everything right. Can someone help? Thanks