How to perform a join on two files within the same RDD loaded using wholeTextFiles()

Question

I am fairly new to spark-scala so please don't mind if this is a beginner question.

I have a directory test which contains two files, input1.txt and input2.txt. Now, lets say i create a RDD called inputRDD using

val inputRDD = sc.wholeTextFiles("/home/hduser/test")

which includes both the files into the pair RDD (inputRDD).

based on my understanding, inputRDD contains file name as the key and contents as the value something like this

(input1.txt,contents of input1.txt)
(input2.txt,contents of input2.txt)

Now, lets say I have to perform a join on both the files this way(which are in the same RDD) based on the first column.

contents of input1.txt
----------------------
1 a
1 b
2 c
2 d

contents of input2.txt
----------------------
1 e
2 f
3 g

How can i do that?

Are you expecting more text files in your test directory? If yes, do you want to join them too? If yes, how? — Ramesh Maharjan
@RameshMaharjan for now I have only two and I am not quite sure how to perform a join in this case because they are in the same RDD. It would have been direct if they were in different RDDs. — Rahul Kumar
@RameshMaharjan Thank you :) But, is there a way to do something similar when the file names are not known? because lets say in the real world scenario when there are multiple small sized files, then giving exact file names would get messy! — Rahul Kumar

moon moon · Accepted Answer · 2017-07-22T17:23:41

You need to first split your content, then do a reduceByKey to format your join. Something like below:

val outputRDD = inputRDD.mapPartitions(iter => {
  iter.map(path_content => {
    // split string content
    val splittedStr = path_content._2.split(" ")
    // outputs (1, a) (1, b) (2, c)
    (splittedStr(0), splittedStr(1)) 
  })
}).reduceByKey(_ + _) // this outputs (1, abe)

How to perform a join on two files within the same RDD loaded using wholeTextFiles()

2 Answers