0
votes

I am new with spark and scala. I want to sum up all the values present in the RDD. below is the example. RDD is key value pair and suppose after doing some join and transformation the output of RDD have 3 record as below, where A is key:

(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))

Now i want to sum up all values of each record with corresponding value in other records, so the output should come like

(A, List(3,3,3,3,3,3,3))

Can anyone please help me out on this. Is there any possible way to achieve this using scala?

Big Thanks in Advance

1
I tried to group them all then add the elements based on the position....but could not get the required resultyuvraj rajpurohit

1 Answers

5
votes

A naive approach is to reduceByKey:

rdd.reduceByKey(
  (xs, ys) => xs.zip(ys).map { case (x, y) => x + y }
)

but it is rather inefficient because it creates a new List on each merge.

You can improve on that by using for example aggregateByKey with mutable buffer:

rdd.aggregateByKey(Array.fill(7)(0)) // Mutable buffer 
  // For seqOp we'll mutate accumulator 
  (acc, xs) => {
    for {
      (x, i) <- xs.zipWithIndex
    } acc(i) += x
    acc
  },
  // For performance you could modify acc1 as above
  (acc1, acc2) => acc1.zip(acc2).map { case(x, y) => x + y }
).mapValues(_.toList)

It should be also possible to use DataFrames but by default recent versions schedule aggregations separately so without adjusting configuration it is probably not worth the effort.