1
votes

I have 2 RDDs that look like this :- RDD1 elements look like this [123, 456, 789] and RDD2 tuples look like this [456, 999]. Now I need to combine/join these 2 RDDs based on 456 which is the 2nd element in RDD1 and the first element in RDD2. Final output looks something like this :- [123, 456, 789, 999]. Is there a way this can be done or do the keys need to be in the first place for the join? Thanks in advance for your time.

1
so RDD1 is made of tuples of 3 elements and RDD2 is made of tuple of 2 elements? - rogue-one
Yes.. that's correct.. I need to combine these 2 RDDs into tuples of 4 elements and then reduce my final joined RDD based on the last element which is 999 in this case.. - Digvijay Sawant

1 Answers

0
votes

you could convert the RDDs to Dataframe and then do a simple join as shown below.

rdd1 = sc.parallelize([(123, 456, 789)])
rdd2 = sc.parallelize([(456, 999)])    
df1 = rdd1.toDF()
df2 = rdd2.toDF()
result = df1.join(df2, df1['_2'] == df2['_1'])
result.rdd.map(lambda x: (x[0],x[1],x[2],x[4])).collect()
[(123, 456, 789, 999)]