I have a Spark DataFrame that looks like:
| id | value | bin |
|----+-------+-----|
| 1 | 3.4 | 2 |
| 2 | 2.6 | 1 |
| 3 | 1.8 | 1 |
| 4 | 9.6 | 2 |
I have a function f
that takes an array of values and returns a number. I want to add a column to the above data frame where the value for the new column in each row is the value of f
for all the value
entries that have the same bin
entry, i.e:
| id | value | bin | f_value |
|----+-------+-----+---------------|
| 1 | 3.4 | 2 | f([3.4, 9.6]) |
| 2 | 2.6 | 1 | f([2.6, 1.8]) |
| 3 | 1.8 | 1 | f([2.6, 1.8]) |
| 4 | 9.6 | 2 | f([3.4, 9.6]) |
Since I need to aggregate all value
s per bin
, I cannot use the withColumn
function to add this new column. What is the best way to do this until user defined aggregation functions make there way into Spark?