I would like to implement below requirement using Spark dataframes to compare 2 text/csv
- List item
files. Ideally, File1.txt should compare with File2.txt and result should be in other txt file with flag as (SAME/UPDATE/INSERT/DELETE).
UPDATE - if any record values are updated in file2 when compared to file1 INSERT - if a new record exist in file2 DELETE - only if the record exist in file1 (not in file2) SAME - if same record exist in both files
File1.txt
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 600
3 HR GOPI 1500
5 HW MAHI 700
File2.txt
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 900
4 MT SUMP 1200
5 HW MAHI 700
Outputfile.txt
NO DEPT NAME SAL FLAG
1 IT RAM 1000 S
2 IT SRI 900 U
4 MT SUMP 1200 I
5 HW MAHI 700 S
3 HR GOPI 1500 D
So far, i did below coding. But not able to proceed further. Pls help.
from pyspark.shell import spark
sc = spark.sparkContext
df1 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\file1.csv")
df2 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\inputs\\file2.csv")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
sqlDF1 = spark.sql( "select * from table1" )
sqlDF2 = spark.sql( "select * from table2" )
leftJoinDF = sqlDF1.join(sqlDF2, 'id', how='left')
rightJoinDF = sqlDF1.join(sqlDF2, 'id', how='right')
innerJoinDF = sqlDF1.join(sqlDF2, 'id')
Is there any way if we merge the data, after performing leftJoin, rightJoin, innerJoin. With this whether i could get desired output or any other way.
Thanks,