I am trying to map the values of one column in my dataframe to a new value and put it into a new column using a UDF, but I am unable to get the UDF to accept a parameter that isn't also a column. For example I have a dataframe dfOriginial like this:
+-----------+-----+
|high_scores|count|
+-----------+-----+
| 9| 1|
| 21| 2|
| 23| 3|
| 7| 6|
+-----------+-----+
And I'm trying to get a sense of the bin the numeric value falls into, so I may construct a list of bins like this:
case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
val binMin = binMax - binWidth
// only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
def rangeAsString(): String = {
val sb = new StringBuilder()
sb.append(trimDecimal(binMin)).append(" - ").append(trimDecimal(binMax))
sb.toString()
}
}
And then I want to transform my old dataframe like this to make dfBin:
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
| 9| 1| 0 - 10 |
| 21| 2| 20 - 30 |
| 23| 3| 20 - 30 |
| 7| 6| 0 - 10 |
+-----------+-----+---------+
So that I can ultimately get a count of the instances of the bins by calling .groupBy("bin_range").count().
I am trying to generate dfBin by using the withColumn function with an UDF.
Here's the code with the UDF I am attempting to use:
val convertValueToBinRangeUDF = udf((value:String, binList:List[Bin]) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
val dfBin = dfOriginal.withColumn("bin_range", convertValueToBinRangeUDF(col("high_scores"), binList))
But it's giving me a type mismatch:
Error:type mismatch;
found : List[Bin]
required: org.apache.spark.sql.Column
val valueCountsWithBin = valuesCounts.withColumn(binRangeCol, convertValueToBinRangeUDF(col(columnName), binList))
Seeing the definition of an UDF makes me think it should handle the conversion fine, but it's clearly not, any ideas?