I need to count the number of occurrences elements in an RDD. This would be easy if I just had, say, letter counts in the RDD like this:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 3)])
rdd.reduceByKey(lambda a,b: a+b).collect() #prints [('a', 4), ('b', 1)]
But each element of the data is from a tweet, meaning there will often be instances of several letters in each, like this:
rdd2 = sc.parallelize([[("a", 2), ("b", 1), ("c", 3)], [("a", 5), ("b", 2)]])
What is an efficient way of combining this into a distributed dataset of key/val tuples with key = letters and val = total number of occurrences?
Solutions I've considered:
- First convert each element to just a list of letters, then reduce with lambda a,b: a+b, then make a Counter. This works but a looot of data is sent to the driver node and the Counter is constructed locally there.
- Convert each element to a dict like {"a" : 2, "b" : 1}, write a method to combine dicts, and reduce using that. I'm a bit worried about this because a) dicts are usually passed by reference in Python and I'm not convinced I fully understand what behavior I'll get if I simply add the items from dict a to dict b in the combiner method. b) I can get around that be creating a new dict in the combiner method, but that means creating very large dictionaries repeatedly when reducing.
Any help would be greatly appreciated.