I'm using Spark 1.5.2 and Python 2.7 in the Ubuntu environment.
According to the documentation about countByValue and countByValueAndWindow:
Transformations on dstreams
Window operations
countByValue: When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
So basically the return value for these two functions should be a list of (K, Long) pairs, right?
However, when I was doing some experiments, the return value turned out to be a list of integers, not pairs!
What's more, in the official test codes on Github for pySpark: Link1 Link2
You can see the "expected results" are a list of integers! And it seems to me that it's counting the number of distinct elements and combining them all together.
I thought I misunderstood the documentation somehow until I saw the test codes on Github for scala: Link1 Link2
Similar test cases, but the results are a sequence of pairs at this time!
So in summary, the documentation and test cases of scala told us the result are pairs. But the python test cases and my own experiments showed that the result are integers.
I'm new to PySpark and spark streaming. Could someone help me explain this inconsistency a little bit? Right now I'm using reduceByKey and reduceByKeyAndWindow as a workaround.
References:
UPDATE
This bug is scheduled to be fixed in pyspark 2.0.0