0
votes

I have a csv file of the format:

key, age, marks, feature_n
abc, 23, 84, 85.3
xyz, 25, 67, 70.2

Here the number of features can vary. In eg: I have 3 features (age, marks and feature_n). I have to convert it into a Map[String,String] as below :

[key,value]
["abc","age:23,marks:84,feature_n:85.3"]
["xyz","age:25,marks:67,feature_n:70.2"]

I have to join the above data with another dataset A on column 'key' and append the 'value' to another column in dataset A. The csv file can be loaded into a dataframe with schema (schema defined by first row of the csv file).

val newRecords = sparkSession.read.option("header", "true").option("mode", "DROPMALFORMED").csv("/records.csv");

Post this I will join the dataframe newRecords with dataset A and append the 'value' to one of the columns of dataset A.

How can I iterate over each column for each row, excluding the column "key" and generate the string of format "age:23,marks:84,feature_n:85.3" from newRecords?

I can alter the format of csv file and have the data in JSON format if it helps.

I am fairly new to Scala and Spark.

1
This looks like a standard map/collect operation to me. Can you please clarify where you're having issues? - Joe C
The number of features can be variable. I tried to signify the same by having last feature named as feature_n. So I need to iterate over variable number of columns to generate the final string. Sorry, it was not explicit from the question. - user2804130

1 Answers

0
votes

I would suggest the following solution:

val updated:RDD[String]=newRecords.drop(newRecords.col("key")).rdd.map(el=>{val a=el.toSeq;val st= "age"+a.head+"marks:"+a(1)+" feature_n:"+a.tail; st})