1
votes

I am trying to create a new column ("newaggCol") in a Spark Dataframe using groupBy and sum (with PySpark 1.5). My numeric columns have been cast to either Long or Double. The columns used to form the groupBy are String and Timestamp. My code is as follows

df= df.withColumn("newaggCol",(df.groupBy([df.strCol,df.tsCol]).sum(df.longCol)))

My traceback for the error is coming to that line. And stating:

ValueError: Cannot convert column into bool: please use '&' for 'and',     '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

I feel that I must be calling the functions incorrectly?

1

1 Answers

4
votes

It is not possible using SQL aggregations but you can easily get the desired result using window functions

import sys
from pyspark.sql.window import Window
from pyspark.sql.functions import sum as sum_

w = (Window()
    .partitionBy(df.strCol, df.tsCol)
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

df.withColumn("newaggCol", sum_(df.longCol).over(w))