2
votes

I am new to Spark/Scala. I have a dataset with many columns, each column has a column name. Given several column names (these column names are not fixed, they are generated dynamically), I need to sum up the values of these columns. Is there an efficient way of doing this?

I worked out a way by using for loop, but I don't think it is efficient:

val AllLabels = List("ID", "val1", "val2", "val3", "val4")
val lbla = List("val1", "val3", "val4")
val index_lbla = lbla.map(x => AllLabels.indexOf(x))

val dataRDD = sc.textFile("../test.csv").map(_.split(","))

dataRDD.map(x=>
 {
  var sum = 0.0
  for (i <- 1 to index_lbla.length) 
    sum = sum + x(i).toDouble
  sum
 }
).collect

The test.csv looks like below (without column names):

"ID", "val1", "val2", "val3", "val4"
 A, 123, 523, 534, 893
 B, 536, 98, 1623, 98472
 C, 537, 89, 83640, 9265
 D, 7297, 98364, 9, 735
 ...

Your help is very much appreciated!

1

1 Answers

2
votes

The for loop you mention is just syntactic sugar for some higher order functions like map in scala. You may want to read more about it.

In this case, you can replace the for loop processing with a map and sum.

dataRDD.map(x => index_lbla.map(i => x(i).toDouble).sum).collect

// note that this also fixes the error in selecting columns for summation in the original version.