have input data as below inpu1 as {col1:"val1",col2:"val2",col3:"val3",.....} input2 as acctno^^email_id I am doing left outer join to join this 2 dataset and giving the final output as {col1:"val1",col2:"val2",col3:"val3",col4:email_id} Please find the below code snippet that I have done so far.
DataFrame DF1 = sqlCtx.jsonRDD(JSONRDD1);
DF1.registerTempTable("DCP");
DataFrame DF2 = sqlCtx.read().json(inputPath1);
DF2.registerTempTable("IDN");
String joinSQL = "SELECT i.col1,i.col2,i.col3,d.email_id from " IDN i LEFT OUTER JOIN DCP d ON i.col1 = d.acctno ";
DataFrame joinedDF = sqlCtx.sql(joinSQL);
joinedDF.repartition(1).toJSON().saveAsTextFile("outputpath");
But the final output has duplicate records, which is not needed.I want to remove the duplicate records. To remove duplicate records I have tried distinct() and dropDuplicates() on the joinedDF ,but it is not able to remove the duplicate records and the output has duplicate records.
Input1 has some 4897 records,input2 has some 2198765 records.The final output should have 4897 records ,but in my case it is coming as 5101 records. I am new to Spark programming using Java.Kindly help me out to solve the above duplicate record issue.
SELECT DISTINCT
in the query? - Gordon Linoff