I have a pairRDD of size 80'000. Only 1.5% of the entries are unique. To filter out the replicated data I call the distinct method:
val newRDD = oldRDD.distinct
However, this only removes the majority of the duplicate data - it leaves between 3-5 duplicates for each unique entry!
I check the remaining entries vs the original entries and they are exactly the same.
Sample of the original data:
(1,(0.0500937328554143, 0.9000767961093774))
(1,(0.0500937328554143, 0.9000767961093774))
(1,(0.0500937328554143, 0.9000767961093774))
Sample of the distinct data:
(1,(0.0500937328554143, 0.9000767961093774))
(1,(0.0500937328554143, 0.9000767961093774))
(1,(0.0500937328554143, 0.9000767961093774))
Is there something that I am missing about how distinct works?