Joining two RDDs column in Apache Spark

Question

This is the already asked question but I could not understand the answers properly.

I have two RDDs with same number of columns and same number of records

RDD1(col1,col2,col3)

and

RDD2(colA,colB,colC)

I need to join them as following :

RDD_FINAL(col1,col2,col3,colA,colB,colC)

There is no key to perform a join between records but they are in order which means the first record of RDD1 is corresponded to first record of RDD2.

RDD doesn't preserve internal order of rows. Your question is quite broad to answer frankly. Please make some efforts reviewing your question ! — eliasah
Also please read on how to ask question on SO ? stackoverflow.com/help/how-to-ask — eliasah
@eliasah thank you for respond and guide...actually these two rdds come from two different text files...the only thing i need is to combine all the columns...one of the reason is to compare "col1 and colA"... i wanted to try the join operation ,however my data set does not have any identity key like primary key in sql. — Rouzbeh Zarandi

Alfilercio Alfilercio · Accepted Answer · 2017-01-08T15:26:53

You can use the zipWithIndex method to add the index of the row as a key to both RDD's, and join by it by the key.

Joining two RDDs column in Apache Spark

2 Answers