2
votes

say that I have a text file called 1.txt and 2.txt. 1.txt contains

1,9,5
2,7,4
3,8,3

and 2.txt contains

1,g,h
2,i,j
3,k,l

So, i joined the two by their keys(first column):

val one = sc.textFile("1.txt").map{
  line => val parts = line.split(",",-1)
    (parts(0),(parts(1),parts(2)))
}

val one = sc.textFile("2.txt").map{
  line => val parts = line.split(",",-1)
    (parts(0),(parts(1),parts(2)))
}

Now, if I understand this correctly, I'm getting

(1,  (  (9,5), (g,h)  ))
(2,  (  (7,4), (i,j)  ))
(3,  (  (8,3), (k,l)  ))

Now, say that I need to sum up all the values of the second column of 1.txt,

  1. how do i do this?

  2. How do I refer to the second column of 2.txt(i.e. g,i,k) in the joined RDD?

  3. Is there any good tutorial for working with RDD ? I'm a spark(and scala) newbie.

1

1 Answers

3
votes
  1. joining is really easy val joined = one.join(two) (noticed you named both RDD's one for some reason, assuming you meant to give them different names)

  2. tuple syntax in scala is tuple._number, so to sum up the 1.txt colums if joined is the joined RDD you do val sum = joined.map(_._2._1._2.toInt).reduce(_+_) , might want to cast to long or even BigInt in the map if these files are really big.

  3. I'd say the best tutorials for spark is their main site, amp camp stuff and I personally like looking through the source and scaladocs. For scala "programming in scala" is a good start.

The whole program, slightly rewritten to use better scala style (disclaimer, not a scala expert)

val one = sc.textFile("1.txt").map{
  _.split(",", -1) match {
    case Array(a, b, c) => (a, ( b, c))
  }
}

val two = sc.textFile("2.txt").map{
    _.split(",", -1) match {
      case Array(a, b, c) => (a, (b, c)) 
    }
    //looks like these two map functions are the same, could refactor into a lambda or non member function
}

val joined = one.join(two)

val sum = joined.map {
     case (_, ((_, num2), (_, _))) => num2.toInt 
}.reduce(_ + _)