how to join two differents dataframes in pyspark?

Question

I have two dataframes one with columns of X year,month and measure, and
columns with x1, x2 which corresonds to the first day and the second day . The first dataframe is:

X year month                   measure                   X1        X2
1     1 2014    12          Max.TemperatureF             64        42
2     2 2014    12         Mean.TemperatureF             52        38
3     3 2014    12          Min.TemperatureF             39        33

The second dataframe where only I have the days.

         X3   X4      X5    X6    X7
         1    51      43    42    45    
         2    44      37    34    42    
         3    37      30    26    38

So I want to join the two dataframes and obtain in pyspark

 X year month                 measure          X1   X2  X3  X4  X5  X6  
'1     1 2014    12       Max.TemperatureF     64   42  1   51  43  42     
'2     2 2014    12       Mean.TemperatureF    52   38  2   44  37  34  
'3     3 2014    12       Min.TemperatureF     39   33  3   37  30  26

I have joined them but they get one dataframe above the another dataframe instead of that they remain in the same rows

  from functools import reduce 
  from pyspark.sql import DataFrame


  def unionAll(*dfs):
  return reduce(DataFrame.unionAll, dfs)

  td = unionAll(*[weather1, weather2])

        X year month             measure               X1        X2
      1     1 2014    12          Max.TemperatureF     64        42
      2     2 2014    12         Mean.TemperatureF     52        38
      3     3 2014    12          Min.TemperatureF     39        33
            X3      X4    X5    X6    
       1    51      43    42    45    
       2    44      37    34    42    
       3    37      30    26    38

So this is a wrong joining .

Rob Rob · Accepted Answer · 2019-07-14T23:29:59

I suppose what you are trying to do is join two tables. To join two tables, you need a common column and since you don't have a common column, you will have to create something. This is how I would tackle this:

# Copy the entire 'X' column (which I am assuming is the index)
weather2 = weather2.withColumn('X', weather1['X'])

# Join the two tables on 'X'
joinExpr = 'X'
td = weather1.join(weather2, joinExpr)

This should solve the problem.

how to join two differents dataframes in pyspark?

1 Answers