I am new to Spark/Scala. I have a dataset with many columns, each column has a column name. Given several column names (these column names are not fixed, they are generated dynamically), I need to sum up the values of these columns. Is there an efficient way of doing this?
I worked out a way by using for loop, but I don't think it is efficient:
val AllLabels = List("ID", "val1", "val2", "val3", "val4")
val lbla = List("val1", "val3", "val4")
val index_lbla = lbla.map(x => AllLabels.indexOf(x))
val dataRDD = sc.textFile("../test.csv").map(_.split(","))
dataRDD.map(x=>
{
var sum = 0.0
for (i <- 1 to index_lbla.length)
sum = sum + x(i).toDouble
sum
}
).collect
The test.csv looks like below (without column names):
"ID", "val1", "val2", "val3", "val4"
A, 123, 523, 534, 893
B, 536, 98, 1623, 98472
C, 537, 89, 83640, 9265
D, 7297, 98364, 9, 735
...
Your help is very much appreciated!