Joining two RDDs when keys are not in the same place

Question

I have 2 RDDs that look like this :- RDD1 elements look like this [123, 456, 789] and RDD2 tuples look like this [456, 999]. Now I need to combine/join these 2 RDDs based on 456 which is the 2nd element in RDD1 and the first element in RDD2. Final output looks something like this :- [123, 456, 789, 999]. Is there a way this can be done or do the keys need to be in the first place for the join? Thanks in advance for your time.

so RDD1 is made of tuples of 3 elements and RDD2 is made of tuple of 2 elements? — rogue-one
Yes.. that's correct.. I need to combine these 2 RDDs into tuples of 4 elements and then reduce my final joined RDD based on the last element which is 999 in this case.. — Digvijay Sawant

rogue-one rogue-one · Accepted Answer · 2017-02-25T18:18:26

you could convert the RDDs to Dataframe and then do a simple join as shown below.

rdd1 = sc.parallelize([(123, 456, 789)])
rdd2 = sc.parallelize([(456, 999)])    
df1 = rdd1.toDF()
df2 = rdd2.toDF()
result = df1.join(df2, df1['_2'] == df2['_1'])
result.rdd.map(lambda x: (x[0],x[1],x[2],x[4])).collect()
[(123, 456, 789, 999)]

Joining two RDDs when keys are not in the same place

1 Answers