Spark: keys intersection in RDDs with different key types

Question

Consider the following situation, you have two RDDs of key-value pairs, where each of the two keys from the two RDDs are of different type.

RDD1(Vector[String], String) look like this:

(Vector("A", "B", "E"), "bla bla bla"),
(Vector("W"), "bla bla bla bla"),
(Vector("C", "M"), "bla bla bla bla bla"),
(Vector("A", "V"), "bla bla bla")
...

RDD2[(String, String)] look like this:

("A", 12),
("B", 434),
("C", 8023),
("D", 3454),
...
("N", 251)

Note: that keys in RDD2 are from A-N inclusive.

The desired output is pairs of the first RDD1 such that every string in the Vector key is a subset of the entire set of keys of RDD2

(Vector("A", "E", "B"), "bla bla bla"),
(Vector("C", "M"), "bla bla bla bla bla")

also if this is not possible with RDDs, I'd like to know how other abstractions like dataframe and dataset could achieve this result

gkc2104 gkc2104 · Accepted Answer · 2017-01-13T23:57:27

def myFilter(rdd1: RDD[(Vector[String],String)], rdd2: RDD[(String,String)]): RDD[(Vector[String],String)] = {

    val keys = rdd2.map(_._1).collect()

    val filtered = rdd1.filter{ entry =>
      entry._1.forall(str => keys.contains(str))
    }
    filtered
}

This is not the most efficient way of getting it done, but gets the job done.

Spark: keys intersection in RDDs with different key types

1 Answers