I have a Spark RDD that looks like this:
[(1, ...),
(1, ...),
(2, ...),
(3, ...)]
And I am trying to remove the records that have duplicate keys, in this case, I want to exclude all the records that have key '1'. And the ultimate output I want should look like
[(2, ...),
(3, ...)]
What I have tried so far, it worked but my gut says there should be a better solution:
>> a = sc.parallelize([(1,[1,1]), (1,[1,1]), (2,[1,1]), (3,[1,1])])
>> print a.groupByKey() \
.filter(lambda x: len(x[1])==1 ) \
.map(lambda x: (x[0], list(x[1])[0] )).collect()
[(2, [1, 1]), (3, [1, 1])]
Can anyone help me on this?