How to remove duplicate records from DataFrame after Left outer join in spark java

Question

have input data as below inpu1 as {col1:"val1",col2:"val2",col3:"val3",.....} input2 as acctno^^email_id I am doing left outer join to join this 2 dataset and giving the final output as {col1:"val1",col2:"val2",col3:"val3",col4:email_id} Please find the below code snippet that I have done so far.

DataFrame DF1 = sqlCtx.jsonRDD(JSONRDD1);
DF1.registerTempTable("DCP");

DataFrame DF2 = sqlCtx.read().json(inputPath1);
DF2.registerTempTable("IDN");

String joinSQL = "SELECT  i.col1,i.col2,i.col3,d.email_id from " IDN i LEFT OUTER JOIN DCP d ON i.col1 = d.acctno ";
DataFrame  joinedDF = sqlCtx.sql(joinSQL);
joinedDF.repartition(1).toJSON().saveAsTextFile("outputpath");

But the final output has duplicate records, which is not needed.I want to remove the duplicate records. To remove duplicate records I have tried distinct() and dropDuplicates() on the joinedDF ,but it is not able to remove the duplicate records and the output has duplicate records.

Input1 has some 4897 records,input2 has some 2198765 records.The final output should have 4897 records ,but in my case it is coming as 5101 records. I am new to Spark programming using Java.Kindly help me out to solve the above duplicate record issue.

I have tried it..but it is not solving my purpose .....still the output has duplicate records. — arjun Mishra

praneethh praneethh · Accepted Answer · 2017-02-12T20:17:48

There might some duplicates in the accntno. If col1 finds duplicates in accntno it might affect the desired records. Consider only distinct accntno and do join.

How to remove duplicate records from DataFrame after Left outer join in spark java

1 Answers