reducebykey and aggregatebykey in spark Dataframe

Question

I am using spark 2.0 to read the data from parquet file .

 val Df = sqlContext.read.parquet("c:/data/parquet1")
 val dfSelect= Df.
      select(
        "id",
        "Currency",
        "balance"

      )



val dfSumForeachId=dfSelect.groupBy("id").sum("balance")
val total=dfSumForeachId.agg(sum("sum(balance)")).first().getDouble(0)

In order to get a total balance value is this the best way of getting it using an action first() on a dataframe ?

In spark 2.0 is it fine to use groupby key ,does it have the same performance issue like groupbykey on rdd like does it need to shuffle the whole data over the network and then perform aggregation or the aggregation is performed locally like reducebykey in earlier version of the spark

Thanks

Assaf Mendelson Assaf Mendelson · Accepted Answer · 2017-01-03T11:07:16

Getting the data by using first is a perfectly valid way of getting the data. That said, doing:

val total = dfSelect.agg(sum("balance")).first().getDouble(0)

would probably give you better performance for getting the total.

group by key and reduce by key work exactly the same as previous versions for the same reasons. group by key makes no assumption on the action you want to do and therefore cannot know how to do partial aggregations as reduce by key does.

When you do dataframe groupby and sum you are actually doing reduce by key with the + option and the second aggregation you did is a reduce with the +. That said dataframe does it more efficiently because, knowing exactly what is done it can perform many optimizations such as whole stage code generation.

reducebykey and aggregatebykey in spark Dataframe

1 Answers