1
votes

In pyspark, considering the two rdds like:

rrd1 = [('my name',5),('name is',4)]

and

rdd2 = [('my',6),('name',10),('is',5)]

where rdd1 is the tuples of bigrams and counts, rdd2 is the tuples of corresponding unigram and counts, I want to have an RDD of tuples of 3 elements like:

RDD = [ (('my name',5),('my',6),('name',10)) , (('name is',4), ('name',10),('is',5)) ]

I tried rdd2.union(rdd1).reduceByKey(lambda x,y : x+y) but in this case it is not the proper way because the keys are different but in some sense they are related.

1
Are you using python or scala? You tagged python, but your code is scala? - Psidom
I'm using python, the examples are just to show an rdd in the form of a list of tuples. I don't know scala! - Elm662

1 Answers

1
votes

You can do this; Split the bigram rdd to generate a key to join with rdd2, and then group by the bigram to collect elements belonging to the same bigram together:

(rdd1.flatMap(lambda x: [(w, x) for w in x[0].split()])    
     .join(rdd2.map(lambda x: (x[0], x)))
     .map(lambda x: x[1])
     .groupBy(lambda x: x[0])
     .map(lambda kv: (kv[0],) + tuple(v[1] for v in kv[1]))
     .collect())

# [(('name is', 4), ('name', 10), ('is', 5)), (('my name', 5), ('name', 10), ('my', 6))]