inconsistent results from groupBy functions in pandas and pyspark

Question

I am currently migrate my scripts from pandas to pyspark. I got inconsistent results from a simple groupBy function in pandas and pyspark and I am confused. I appreciate if some one can help me.

My panel data set looks like

RIC     date     hour   minute     volume
VOD      01-01    9       11        55    
VOD      01-01    9       11        55
VOD      01-01    10      12        79
VOD      01-01    10      13        55
VOD      01-01    10      15        245  
VOD      01-01    11      14        356
VOD      01-02    11      15        6798
...       ...     ...     ...       ...
BAT      01-01    9       11        556   
BAT      01-02    9       12        552   
BAT      01-02    9       14        551   
...       ...     ...     ....     ...

In pandas, I used the following code to get the total volume for each stock in each minute

    df=pd.read_csv(r'/home/user/stock.csv', parse_dates=[1])
    df_volume=df.groupby(['RIC','date','hour','minute']).sum().reset_index()
    df_volume.head(5)

Then I got the correct output

RIC     date     hour   minute     volume
VOD      01-01    9       11        110    
VOD      01-01    10      12        79
VOD      01-01    10      13        55
...       ...     ...     ....     ...

However, when I code in spark, I used the following

    df=spark.read.format('csv').option('header','true').load('/home/user/stock.csv')
    df.printSchema()

    root
      |-- RIC: string (nullable = true)
      |-- date: date (nullable = true)  
      |-- hour: float (nullable = true)
      |-- minute: float (nullable = true)  
      |-- volume: float (nullable = true)

Then I coded

    from pyspark.sql.functions import countDistinct, avg,stddev,count, sum
    df_volume=df.groupBy(['RIC','date','hour','minute']).agg(sum(volume))
    df_volume.orderBy(['RIC','date','hour','minute'],ascending=[True,True,True])
    df_volume.show()

Then I got the incorrect output

+----+--------+-------+----------+----------+
 RIC   date     hour    minute     volume
+----+--------+-------+----------+----------+
 VOD | 01-02  |  10   |   13     |   355    |
 VOD | 01-03  |  14   |   03     |   357    |
 VOD | 01-05  |  15   |   45     |   683    | 
 ...    ...     ...     ....     ...

This time a few observations are missing in spark output. I think I coded everything right. Can someone help? Thanks

Topde Topde · Accepted Answer · 2020-03-07T11:36:39

I can help you with this but need to know that that you've : - Done a count to check number of rows in both dfs - Checked the entire df for the missing value, for example filtering by the missing value to see if it is in the df and not just checked head().

inconsistent results from groupBy functions in pandas and pyspark

1 Answers