How to calculate the count of words per line in pyspark

Question

I tried this :

rdd1= sc.parallelize(["Let's have some fun.",
  "To have fun you don't need any plans."])
output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: (lists, len(lists)))
output.foreach(print)

output:

(["Let's", 'have', 'some', 'fun.'], 4)
(['To', 'have', 'fun', 'you', "don't", 'need', 'any', 'plans.'], 8)

and i got the count of total number of words per line. but I wanted the count of each word per line.

pissall pissall · Accepted Answer · 2020-03-11T08:17:55

You can try this:

from collections import Counter 

output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: dict(Counter(lists)))

I'll give a small python example:

from collections import Counter

example_1 = "Let's have some fun."
Counter(example_1.split(" "))
# [{"Let's": 1, 'have': 1, 'some': 1, 'fun.': 1}

example_2 = "To have fun you don't need any plans."
Counter(example_2.split(" "))
# {'To': 1, 'have': 1, 'fun': 1, 'you': 1, "don't": 1, 'need': 1, 'any': 1, 'plans.': 1}]

How to calculate the count of words per line in pyspark

2 Answers