merge two RDD where the keys are not same but related

Question

In pyspark, considering the two rdds like:

rrd1 = [('my name',5),('name is',4)]

and

rdd2 = [('my',6),('name',10),('is',5)]

where rdd1 is the tuples of bigrams and counts, rdd2 is the tuples of corresponding unigram and counts, I want to have an RDD of tuples of 3 elements like:

RDD = [ (('my name',5),('my',6),('name',10)) , (('name is',4), ('name',10),('is',5)) ]

I tried rdd2.union(rdd1).reduceByKey(lambda x,y : x+y) but in this case it is not the proper way because the keys are different but in some sense they are related.

Are you using python or scala? You tagged python, but your code is scala? — Psidom
I'm using python, the examples are just to show an rdd in the form of a list of tuples. I don't know scala! — Elm662

Psidom Psidom · Accepted Answer · 2017-05-12T17:31:10

You can do this; Split the bigram rdd to generate a key to join with rdd2, and then group by the bigram to collect elements belonging to the same bigram together:

(rdd1.flatMap(lambda x: [(w, x) for w in x[0].split()])    
     .join(rdd2.map(lambda x: (x[0], x)))
     .map(lambda x: x[1])
     .groupBy(lambda x: x[0])
     .map(lambda kv: (kv[0],) + tuple(v[1] for v in kv[1]))
     .collect())

# [(('name is', 4), ('name', 10), ('is', 5)), (('my name', 5), ('name', 10), ('my', 6))]

merge two RDD where the keys are not same but related

1 Answers