I am using spark 1.2 with scala and have a pair RDD with (String, String). A sample record looks like:
<Key, value>
id_1, val_1_1; val_1_2
id_2, val_2_1; val_2_2
id_3, val_3_1; val_3_2
id_1, val_4_1; val_4_2
I just want to remove all the records with duplicate key, so in the above example, fourth record will be removed because id_1 is a duplicate key.
Pls help.
Thanks.
reduceByKey
, as suggested in the answer below, you have no way to know which value will be picked. There's no guarantee that Spark maintains the ordering of the rows. Is there something about the value (such as the fact it is _1_1) that you can use to differentiate? – mattinbits