Take these two RDDs:
val rdd1 = sc.parallelize(Array(("test","foo"),("test2", "foo2")))
val rdd2 = sc.parallelize(Array(("test","foo3"),("test3", "foo4")))
When you join them, you have a couple of options. What you do depends on what you want. Do you want an RDD only with the common keys?
val leftJoined = rdd1.join(rdd2)
leftJoined.collect
res1: Array[(String, (String, String))] = Array((test,(foo,foo3)))
If you want keys missing from rdd2
to be filled in with None
, use leftOuterJoin
:
val leftOuter = rdd.leftOuterJoin(rdd2)
leftOuter.collect
res2: Array[(String, (String, Option[String]))] = Array((test2,(foo2,None)), (test,(foo,Some(foo3))))
If you want keys missing from either side to be filled in with None
, use fullOuterJoin
:
val fullOuter = rdd1.fullOuterJoin(rdd2)
fullOuter.collect
res3: Array[(String, (Option[String], Option[String]))] = Array((test2,(Some(foo2),None)), (test3,(None,Some(foo4))), (test,(Some(foo),Some(fo3))))