I have a dataframe with 3 columns - number (Integer), Name (String), Color (String). Below is the result of df.show with repartition option.
val df = sparkSession.read.format("csv").option("header", "true").option("inferschema", "true").option("delimiter", ",").option("decoding", "utf8").load(fileName).repartition(5).toDF()
+------+------+------+
|Number| Name| Color|
+------+------+------+
| 4|Orange|Orange|
| 3| Apple| Green|
| 1| Apple| Red|
| 2|Banana|Yellow|
| 5| Apple| Red|
+------+------+------+
My objective is to create list of strings corresponding to each row by replacing the tokens in common dynamic string which I am passing as parameter to the method with the column values For example: commonDynamicString = Column.Name with Column.Color color
In this string, my tokens are Column.Name and Column.Color. I need to replace these values for all the rows with respective values in that column. Note: this string can change dynamically hence hardcoding won’t work.
I don't want to use RDD unless no other option is available with dataframe.
Below are the approaches I tried but couldn't achieve my objective.
Option 1:
val a = df.foreach(t => {
finalValue = commonString.replace("Column.Number", t.getAs[Any]("Number").toString())
.replace("DF.Name", t.getAs("Name"))
.replace("DF.Color", t.getAs("Color"))
println ("finalValue: " +finalValue)
})
With this approach, the finalValue prints as expected. However, I cannot create a listbuffer or pass the final string from here as a list to other function as foreach returns Unit and spark throws error.
Option 2: I am thinking about this option but would need some guidance to understand if foldleft or window or any other spark functions can be used to create a 4th column called "Final" using withColumn option and use a UDF where I can extract all the tokens using regex pattern matching - "Column.\w+" and do replace operation for the tokens?
+------+------+------+--------------------------+
|Number| Name| Color| Final |
+------+------+------+--------------------------+
| 4|Orange|Orange|Orange with orange color |
| 3| Apple| Green|Apple with Green color |
| 1| Apple| Red|Apple with Red color |
| 2|Banana|Yellow|Banana with Yellow color |
| 5| Apple| Red|Apple with Red color |
+------+------+------+--------------------------+
Can someone help me with this problem and also to let me know if I am thinking in the right direction to use spark for handling large datasets?
Thanks!