3
votes

I have two RDDs.

rdd1 = (String, String)

key1, value11
key2, value12
key3, value13

rdd2 = (String, String)

key2, value22
key3, value23
key4, value24

I need to form another RDD with merged rows from rdd1 and rdd2, the output should look like:

key2, value12 ; value22
key3, value13 ; value23

So, basically it's nothing but taking the intersection of the keys of rdd1 and rdd2 and then join their values. ** The values should be in order i.e. value(rdd1) + value(rdd2) and not reverse.

2

2 Answers

4
votes

I think this may be what you are looking for:

join(otherDataset, [numTasks])  

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

See the associated section of the docs