how to fetch only new records available in one dataframe and not in other using pyspark?

Question

I have two dataframe-df1 and df2. df1 consists of 70 rows and 7 columns and df2 consists of 80 rows and 7 columns.

How to fetch only those records from df2 which have any new record values across any columns with respect to df1 i.e. the record value is not exist in df1 in pyspark-2.2.0?

i tried using this left join query approach but unable to execute this in sqlContext.sql().

sqlContext.sql(
select df2.*,df1.* from df2 
left join (select * from df1)
on (df2.col1=df1.col1 
AND df2.col2=df1.col2
AND df2.col3 =df1.col3 
AND df2.col4=df1.col4
AND df2.col5=df1.col5
AND df2.col6=df1.col6
AND df2.col7=df1.col7) 
where df1.col1 is null 
AND df1.col2 is null 
AND df1.col3 is null
AND df1.col4 is null
AND df1.col5 is null
AND df1.col6 is null
AND df1.col7 is null).show()

cronoik cronoik · Accepted Answer · 2018-10-06T12:57:34

Use the dataframe methode subtract [1]. Example:

l1 = [(1,2),(3,4),(5,6), (12,537)]
l2 = [(1,2),(3,4),(5,6), (7,8), (9,10), (12,537)]
df1 = spark.createDataFrame(l1,['col1','col2'])
df2 = spark.createDataFrame(l2,['col1','col2'])

result = df2.subtract(df1)
result.show()
+----+----+ 
|col1|col2|
+----+----+ 
| 7  | 8  |
| 9  | 10 |
+----+----+

[1] https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.subtract

how to fetch only new records available in one dataframe and not in other using pyspark?

1 Answers