0
votes

I am currently migrate my scripts from pandas to pyspark. I got inconsistent results from a simple groupBy function in pandas and pyspark and I am confused. I appreciate if some one can help me.

My panel data set looks like

RIC     date     hour   minute     volume
VOD      01-01    9       11        55    
VOD      01-01    9       11        55
VOD      01-01    10      12        79
VOD      01-01    10      13        55
VOD      01-01    10      15        245  
VOD      01-01    11      14        356
VOD      01-02    11      15        6798
...       ...     ...     ...       ...
BAT      01-01    9       11        556   
BAT      01-02    9       12        552   
BAT      01-02    9       14        551   
...       ...     ...     ....     ...  

In pandas, I used the following code to get the total volume for each stock in each minute

    df=pd.read_csv(r'/home/user/stock.csv', parse_dates=[1])
    df_volume=df.groupby(['RIC','date','hour','minute']).sum().reset_index()
    df_volume.head(5)

Then I got the correct output

RIC     date     hour   minute     volume
VOD      01-01    9       11        110    
VOD      01-01    10      12        79
VOD      01-01    10      13        55
...       ...     ...     ....     ...  

However, when I code in spark, I used the following

    df=spark.read.format('csv').option('header','true').load('/home/user/stock.csv')
    df.printSchema()
    root
      |-- RIC: string (nullable = true)
      |-- date: date (nullable = true)  
      |-- hour: float (nullable = true)
      |-- minute: float (nullable = true)  
      |-- volume: float (nullable = true)

Then I coded

    from pyspark.sql.functions import countDistinct, avg,stddev,count, sum
    df_volume=df.groupBy(['RIC','date','hour','minute']).agg(sum(volume))
    df_volume.orderBy(['RIC','date','hour','minute'],ascending=[True,True,True])
    df_volume.show()

Then I got the incorrect output

+----+--------+-------+----------+----------+
 RIC   date     hour    minute     volume
+----+--------+-------+----------+----------+
 VOD | 01-02  |  10   |   13     |   355    |
 VOD | 01-03  |  14   |   03     |   357    |
 VOD | 01-05  |  15   |   45     |   683    | 
 ...    ...     ...     ....     ... 

This time a few observations are missing in spark output. I think I coded everything right. Can someone help? Thanks

1

1 Answers

0
votes

I can help you with this but need to know that that you've : - Done a count to check number of rows in both dfs - Checked the entire df for the missing value, for example filtering by the missing value to see if it is in the df and not just checked head().