Remove duplicate keys from Spark Scala

Question

I am using spark 1.2 with scala and have a pair RDD with (String, String). A sample record looks like:

<Key,  value>
id_1,  val_1_1; val_1_2
id_2,  val_2_1; val_2_2
id_3,  val_3_1; val_3_2
id_1,  val_4_1; val_4_2

I just want to remove all the records with duplicate key, so in the above example, fourth record will be removed because id_1 is a duplicate key.

Pls help.

Thanks.

Where there are duplicate keys, how will you decide which value to keep? — mattinbits
The problem is that when Spark does a reduceByKey, as suggested in the answer below, you have no way to know which value will be picked. There's no guarantee that Spark maintains the ordering of the rows. Is there something about the value (such as the fact it is _1_1) that you can use to differentiate? — mattinbits
@mattinbits, ziipWithIndex first, then in the reduce, just keep the one with the lowest index, then map afterwards to remove the index. Viola! No, wait ,that's a large violin. Volia! — The Archetypal Paul

Jean Logeart Jean Logeart · Accepted Answer · 2015-07-27T14:37:01

You can use reduceByKey:

val rdd: RDD[(K, V)] = // ...
val res: RDD[(K, V)] = rdd.reduceByKey((v1, v2) => v1)