Join two dataframes in pyspark by one column

Question

I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:

df1:

  id    a     b
  2     1     1
  3    0.5    1
  4     1     2
  5     2     1

df2:

 id      c    d
  2      fs   a
  5      fa   f

Desired output:

I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.

Psidom Psidom · Accepted Answer · 2017-09-26T18:12:48

df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:

df1.join(df2.select("id"), "id").show()

+---+---+---+
| id|  a|  b|
+---+---+---+
|  5|2.0|  1|
|  2|1.0|  1|
+---+---+---+

or:

df1.join(df2[["id"]], "id").show()
+---+---+---+
| id|  a|  b|
+---+---+---+
|  5|2.0|  1|
|  2|1.0|  1|
+---+---+---+

Join two dataframes in pyspark by one column

2 Answers