2
votes

I tried this :

rdd1= sc.parallelize(["Let's have some fun.",
  "To have fun you don't need any plans."])
output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: (lists, len(lists)))
output.foreach(print)

output:

(["Let's", 'have', 'some', 'fun.'], 4)
(['To', 'have', 'fun', 'you', "don't", 'need', 'any', 'plans.'], 8)

and i got the count of total number of words per line. but I wanted the count of each word per line.

2
Do you want number of words and their occurences?whatsinthename

2 Answers

3
votes

You can try this:

from collections import Counter 

output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: dict(Counter(lists)))

I'll give a small python example:

from collections import Counter

example_1 = "Let's have some fun."
Counter(example_1.split(" "))
# [{"Let's": 1, 'have': 1, 'some': 1, 'fun.': 1}

example_2 = "To have fun you don't need any plans."
Counter(example_2.split(" "))
# {'To': 1, 'have': 1, 'fun': 1, 'you': 1, "don't": 1, 'need': 1, 'any': 1, 'plans.': 1}]
1
votes

Based on your input and from what I understand please find below code. Just minor changes to your code:

output = rdd1.flatMap(lambda t: t.split(" ")).map(lambda lists: (lists, 1)).reduceByKey(lambda x,y : x+y)  

You used map for splitting data. Instead use flatMap. It will break your string into words. PFB output:

output.collect()

[('have', 2), ("Let's", 1), ('To', 1), ('you', 1), ('need', 1), ('fun', 1), ("don't", 1), ('any', 1), ('some', 1), ('fun.', 1), ('plans.', 1)]