0
votes

This seems like such a trivial question but I cannot find an answer anywhere!

I have two RDDs, one with a vectorized article and another with a bunch of stopwords. My first instinct was to use the filter function but apparently you can't have two RDDs interact in that way. I know Union allows RDDs to interact but I need the exact opposite of that so I can filter out all of the stopwords in my first RDD.

Any help would be much appreciated.

EDIT:

RDD1_filtered = RDD1.filter(lambda word: word not in RDD2)

Both RDDs are a list of words. I get an error saying I cannot have two RDDs interacting

2
Can you show code? You seem to have a tuple in your RDD, so why can't you filter it? - OneCricketeer
I added the command I am attempting to use to filter. - madsthaks

2 Answers

4
votes

It sounds like you want the subtract function:

>>> left = sc.parallelize(range(10))
>>> right = sc.parallelize([2, 6])
>>> left.subtract(right).collect()
[0, 1, 3, 4, 5, 7, 8, 9]
0
votes

If you were using DataFrames you could use Dataframe method substract https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.subtract

subtract(other)

Return a new DataFrame containing rows in this frame but not in another frame.

Edit: It seems that subtract also works for RDD