What is the opposite of Union in Pyspark

Question

This seems like such a trivial question but I cannot find an answer anywhere!

I have two RDDs, one with a vectorized article and another with a bunch of stopwords. My first instinct was to use the filter function but apparently you can't have two RDDs interact in that way. I know Union allows RDDs to interact but I need the exact opposite of that so I can filter out all of the stopwords in my first RDD.

Any help would be much appreciated.

EDIT:

RDD1_filtered = RDD1.filter(lambda word: word not in RDD2)

Both RDDs are a list of words. I get an error saying I cannot have two RDDs interacting

Can you show code? You seem to have a tuple in your RDD, so why can't you filter it? — OneCricketeer

santon santon · Accepted Answer · 2017-02-22T00:26:47

It sounds like you want the subtract function:

>>> left = sc.parallelize(range(10))
>>> right = sc.parallelize([2, 6])
>>> left.subtract(right).collect()
[0, 1, 3, 4, 5, 7, 8, 9]

What is the opposite of Union in Pyspark

2 Answers