I have 2 spark RDD, the 1st one contains a mapping between some indices and ids which are strings and the 2nd one contains tuples of related indices
val ids = spark.sparkContext.parallelize(Array[(Int, String)](
(1, "a"), (2, "b"), (3, "c"), (4, "d"), (5, "e"))).toDF("index", "idx")
val relationships = spark.sparkContext.parallelize(Array[(Int, Int)](
(1, 3), (2, 3), (4, 5))).toDF("index1", "index2")
I want to join somehow these RDD (or merge or sql or any best spark practice) to have at the end related ids instead:
The result of my combined RDD should return:
("a", "c"), ("b", "c"), ("d", "e")
Any idea how I can achieve this operation in an optimal way without loading any of the RDD into a memory map (because in my scenarios, these RDD can potentially load millions of records)