0
votes

I have an RDD in the form (Group,[word1,word2,..wordn]). It contains a group and the words which are under that group. If I have the below input

rdd=(g1,[w1,w2,w4]),(g2[w3,w2]),(g3[w4.w1]),(g3[w1,w2,w3]),(g2[w2])

I would want to collect the output saying how many times a word occurs in a group. The output format will be.

Word  Group1 Group2  Group3
w1     1       0       2
w2     1       2       1
w3     0       1       1
w4     1       0       1

What would be the pyspark functions I can use to achieve this output in most efficient way

1
What you're asking for can probably be done by converting to DF and calling explode() and pivot(). You're more likely to get responses if you provided an minimal reproducible example so that people can easily recreate your data. See more here: how to create good reproducible apache spark dataframe examples.pault

1 Answers

0
votes

You should use reduceByKey on your rdd to combine the arrays of common keys as

def combineArrays(x, y):
    return x + y
rdd = rdd.reduceByKey(combineArrays)

Then use Counter function of collections to count the occurance of each element in the combined arrays as

from collections import Counter
rdd.mapValues(lambda x: Counter(x))

you should have output as

('g3', Counter({'w1': 2, 'w4': 1, 'w3': 1, 'w2': 1}))
('g1', Counter({'w4': 1, 'w2': 1, 'w1': 1}))
('g2', Counter({'w2': 2, 'w3': 1}))

I hope the answer is helpful