2
votes

I have a Spark DataFrame that looks like:

| id | value | bin |
|----+-------+-----|
|  1 |   3.4 |   2 |
|  2 |   2.6 |   1 |
|  3 |   1.8 |   1 |
|  4 |   9.6 |   2 |

I have a function f that takes an array of values and returns a number. I want to add a column to the above data frame where the value for the new column in each row is the value of f for all the value entries that have the same bin entry, i.e:

| id | value | bin | f_value       |
|----+-------+-----+---------------|
|  1 |   3.4 |   2 | f([3.4, 9.6]) |
|  2 |   2.6 |   1 | f([2.6, 1.8]) |
|  3 |   1.8 |   1 | f([2.6, 1.8]) |
|  4 |   9.6 |   2 | f([3.4, 9.6]) |

Since I need to aggregate all values per bin, I cannot use the withColumn function to add this new column. What is the best way to do this until user defined aggregation functions make there way into Spark?

1

1 Answers

2
votes

Below code is not tested, but just an idea.

In Hive, it can be done like this using collect_list function.

val newDF = sqlContext.sql(
    "select bin, collect_list() from aboveDF group by bin")

Next join aboveDF and newDF on bin.

Is this what you are looking for?