Grouping and counting in Pyspark

Question

I have an RDD in the form (Group,[word1,word2,..wordn]). It contains a group and the words which are under that group. If I have the below input

rdd=(g1,[w1,w2,w4]),(g2[w3,w2]),(g3[w4.w1]),(g3[w1,w2,w3]),(g2[w2])

I would want to collect the output saying how many times a word occurs in a group. The output format will be.

Word  Group1 Group2  Group3
w1     1       0       2
w2     1       2       1
w3     0       1       1
w4     1       0       1

What would be the pyspark functions I can use to achieve this output in most efficient way

What you're asking for can probably be done by converting to DF and calling explode() and pivot(). You're more likely to get responses if you provided an minimal reproducible example so that people can easily recreate your data. See more here: how to create good reproducible apache spark dataframe examples. — pault

Ramesh Maharjan Ramesh Maharjan · Accepted Answer · 2018-02-21T18:09:20

You should use reduceByKey on your rdd to combine the arrays of common keys as

def combineArrays(x, y):
    return x + y
rdd = rdd.reduceByKey(combineArrays)

Then use Counter function of collections to count the occurance of each element in the combined arrays as

from collections import Counter
rdd.mapValues(lambda x: Counter(x))

you should have output as

('g3', Counter({'w1': 2, 'w4': 1, 'w3': 1, 'w2': 1}))
('g1', Counter({'w4': 1, 'w2': 1, 'w1': 1}))
('g2', Counter({'w2': 2, 'w3': 1}))

I hope the answer is helpful