I have an RDD that I've used to load binary files. Each file is broken into multiple parts and processed. After the processing step, each entry is:
(filename, List[Results])
Since the files are broken into several parts, the filename is the same for several entries in the RDD. I'm trying to put the results for each part back together using reduceByKey. However, when I attempt to run a count on this RDD it returns 0:
val reducedResults = my_rdd.reduceByKey((resultsA, resultsB) => resultsA ++ resultsB)
reducedResults.count() // 0
I've tried changing the key it uses with no success. Even with extremely simple attempts to group the results I don't get any output.
val singleGroup = my_rdd.groupBy((k, v) => 1)
singleGroup.count() // 0
On the other hand, if I simply collect the results, then I can group them outside of Spark and everything works fine. However, I still have additional processing that I need to do on the collected results, so that isn't a good option.
What could cause the groupBy/reduceBy commands to return empty RDDs if the initial RDD isn't empty?
my_rdd
is empty - please convince me otherwise by providing a sample of your data, so we try to can reproduce your problem. – Glennie Helles Sindholt