1
votes

I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:

df1:

  id    a     b
  2     1     1
  3    0.5    1
  4     1     2
  5     2     1

df2:

 id      c    d
  2      fs   a
  5      fa   f

Desired output:

df:
  id   a   b
   2   1   1
   5   2   1

I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.

2

2 Answers

3
votes

df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:

df1.join(df2.select("id"), "id").show()

+---+---+---+
| id|  a|  b|
+---+---+---+
|  5|2.0|  1|
|  2|1.0|  1|
+---+---+---+

or:

df1.join(df2[["id"]], "id").show()
+---+---+---+
| id|  a|  b|
+---+---+---+
|  5|2.0|  1|
|  2|1.0|  1|
+---+---+---+
2
votes

If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).

df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")

df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])

Create df2.id as list and pass it to df1 under isin()

from pyspark.sql.functions import col

df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()

df1.where(col('id').isin(df2_list)).show()

#+---+---+---+
#| id|  a|  b|
#+---+---+---+
#|  2|  1|  1|
#|  5|  2|  1|
#+---+---+---+

It is reccomended to use isin() IF -

  • You don't need to return data from the refrence dataframe/table

  • You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)

  • You just want to check existence of particular value