I'm trying to establish a cohort study to track in-app user behavior and I want ask if you have any idea about how i can exclude an element from an RDD 2 which is in RDD 1. Given :
rdd1 = sc.parallelize([("a", "xoxo"), ("b", 4)])
rdd2 = sc.parallelize([("a", (2, "6play")), ("c", "bobo")])
For exemple, to have the common element between rdd1 and rdd2, we have just to do :
rdd1.join(rdd2).map(lambda (key, (values1, values2)) : (key, values2)).collect()
Which gives :
[('a', (2, '6play'))]
So, this join will find the common element between rdd1 and rdd2 and take key and values from rdd2 only. I want to do the opposite : find elements which are in rdd2 and not in rdd1, and take key and values from rdd2 only. In other words, I want to get items from rdd2 which aren't present in rdd1. So the expected output is :
("c", "bobo")
Ideas ? Thank you :)