1
votes

I need to count the number of occurrences elements in an RDD. This would be easy if I just had, say, letter counts in the RDD like this:

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 3)])
rdd.reduceByKey(lambda a,b: a+b).collect()  #prints [('a', 4), ('b', 1)]

But each element of the data is from a tweet, meaning there will often be instances of several letters in each, like this:

rdd2 = sc.parallelize([[("a", 2), ("b", 1), ("c", 3)], [("a", 5), ("b", 2)]])

What is an efficient way of combining this into a distributed dataset of key/val tuples with key = letters and val = total number of occurrences?

Solutions I've considered:

  • First convert each element to just a list of letters, then reduce with lambda a,b: a+b, then make a Counter. This works but a looot of data is sent to the driver node and the Counter is constructed locally there.
  • Convert each element to a dict like {"a" : 2, "b" : 1}, write a method to combine dicts, and reduce using that. I'm a bit worried about this because a) dicts are usually passed by reference in Python and I'm not convinced I fully understand what behavior I'll get if I simply add the items from dict a to dict b in the combiner method. b) I can get around that be creating a new dict in the combiner method, but that means creating very large dictionaries repeatedly when reducing.

Any help would be greatly appreciated.

1

1 Answers

2
votes

Just flatMap and reduceByKey:

rdd2.flatMap(lambda x: x).reduceByKey(lambda x, y: x + y)

which collected would give:

[('b', 3), ('c', 3), ('a', 7)]