Apache Spark RDD distinct - strange behaviour

Question

I have a pairRDD of size 80'000. Only 1.5% of the entries are unique. To filter out the replicated data I call the distinct method:

val newRDD = oldRDD.distinct

However, this only removes the majority of the duplicate data - it leaves between 3-5 duplicates for each unique entry!

I check the remaining entries vs the original entries and they are exactly the same.

Sample of the original data:

(1,(0.0500937328554143, 0.9000767961093774))
(1,(0.0500937328554143, 0.9000767961093774))
(1,(0.0500937328554143, 0.9000767961093774))

Sample of the distinct data:

(1,(0.0500937328554143, 0.9000767961093774))
(1,(0.0500937328554143, 0.9000767961093774))
(1,(0.0500937328554143, 0.9000767961093774))

Is there something that I am missing about how distinct works?

Daniel Darabos Daniel Darabos · Accepted Answer · 2015-04-06T18:33:27

These numbers compare equal after being converted to strings, but based on how distinct treats them they must not compare equal before the conversion. Instead of printing them, check the result of comparison (==).

Apache Spark RDD distinct - strange behaviour

1 Answers