0
votes

I am new to Scala programming , i have worked on R very extensively but while working for scala it has become tough to work in a loop to extract specific columns to perform computation on the column values

let me explain with help of an example :

i have Final dataframe arrived after joining the 2 dataframes, now i need to perform calculation like enter image description here

Above is the computation with reference to the columns , so after computation we'll get the below spark dataframe

enter image description here

How to refer to the column index in for-loop to compute the new column values in spark dataframe in scala

1

1 Answers

1
votes

Here is one solution:

Input Data:
+---+---+---+---+---+---+---+---+---+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |
+---+---+---+---+---+---+---+---+---+
|24 |74 |74 |21 |66 |65 |100|27 |19 |
+---+---+---+---+---+---+---+---+---+

Zipped the columns to remove the non-matching columns:

val oneCols = data.schema.filter(_.name.contains("1")).map(x => x.name).sorted
val twoCols = data.schema.filter(_.name.contains("2")).map(x => x.name).sorted
val cols = oneCols.zip(twoCols) 

//cols: Seq[(String, String)] = List((a1,a2), (b1,b2), (c1,c2), (d1,d2))

Use foldLeft function to dynamically add columns:

import org.apache.spark.sql.functions._
val result = cols.foldLeft(data)((data,c) => data.withColumn(s"Diff_${c._1}",
                                                  (col(s"${lit(c._2)}") - col(s"${lit(c._1)}"))/col(s"${lit(c._2)}")))

Here is the result:

result.show(false)  

+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |Diff_a1           |Diff_b1|Diff_c1            |Diff_d1             |
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|24 |74 |74 |21 |66 |65 |100|27 |19 |0.6307692307692307|0.26   |-1.7407407407407407|-0.10526315789473684|
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+