I have 2 RDDs that look like this :- RDD1 elements look like this [123, 456, 789] and RDD2 tuples look like this [456, 999]. Now I need to combine/join these 2 RDDs based on 456 which is the 2nd element in RDD1 and the first element in RDD2. Final output looks something like this :- [123, 456, 789, 999]. Is there a way this can be done or do the keys need to be in the first place for the join? Thanks in advance for your time.
1
votes
1 Answers
0
votes
you could convert the RDDs to Dataframe and then do a simple join as shown below.
rdd1 = sc.parallelize([(123, 456, 789)])
rdd2 = sc.parallelize([(456, 999)])
df1 = rdd1.toDF()
df2 = rdd2.toDF()
result = df1.join(df2, df1['_2'] == df2['_1'])
result.rdd.map(lambda x: (x[0],x[1],x[2],x[4])).collect()
[(123, 456, 789, 999)]