How to get the difference between two RDDs in PySpark?

Question

I'm trying to establish a cohort study to track in-app user behavior and I want ask if you have any idea about how i can exclude an element from an RDD 2 which is in RDD 1. Given :

rdd1 = sc.parallelize([("a", "xoxo"), ("b", 4)])

rdd2 = sc.parallelize([("a", (2, "6play")), ("c", "bobo")])

For exemple, to have the common element between rdd1 and rdd2, we have just to do :

rdd1.join(rdd2).map(lambda (key, (values1, values2)) : (key, values2)).collect()

Which gives :

[('a', (2, '6play'))]

So, this join will find the common element between rdd1 and rdd2 and take key and values from rdd2 only. I want to do the opposite : find elements which are in rdd2 and not in rdd1, and take key and values from rdd2 only. In other words, I want to get items from rdd2 which aren't present in rdd1. So the expected output is :

("c", "bobo")

Ideas ? Thank you :)

Arij SEDIRI Arij SEDIRI · Accepted Answer · 2016-11-17T14:32:22

11

votes

I just got the answer and it's very simple !

rdd2.subtractByKey(rdd1).collect()

Enjoy :)

How to get the difference between two RDDs in PySpark?

1 Answers