I have an RDD in the form (Group,[word1,word2,..wordn]). It contains a group and the words which are under that group. If I have the below input
rdd=(g1,[w1,w2,w4]),(g2[w3,w2]),(g3[w4.w1]),(g3[w1,w2,w3]),(g2[w2])
I would want to collect the output saying how many times a word occurs in a group. The output format will be.
Word Group1 Group2 Group3
w1 1 0 2
w2 1 2 1
w3 0 1 1
w4 1 0 1
What would be the pyspark functions I can use to achieve this output in most efficient way
explode()
andpivot()
. You're more likely to get responses if you provided an minimal reproducible example so that people can easily recreate your data. See more here: how to create good reproducible apache spark dataframe examples. – pault