I have a key-value pair RDD. The RDD contains some elements with duplicate keys, and I want to split original RDD into two RDDs: One stores elements with unique keys, and another stores the rest elements. For example,
Input RDD (6 elements in total):
<k1,v1>, <k1,v2>, <k1,v3>, <k2,v4>, <k2,v5>, <k3,v6>
Result:
Unique keys RDD (store elements with unique keys; For the multiple elements with the same key, any element is accepted):
<k1,v1>, <k2, v4>, <k3,v6>
Duplicated keys RDD (store the rest elements with duplicated keys):
<k1,v2>, <k1,v3>, <k2,v5>
In the above example, unique RDD has 3 elements, and the duplicated RDD has 3 elements too.
I tried groupByKey() to group elements with the same key together. For each key, there is a sequence of elements. However, the performance of groupByKey() is not good because the data size of element value is very big which causes very large data size of shuffle write.
So I was wondering if there is any better solution. Or is there a way to reduce the amount of data being shuffled when using groupByKey()?