0
votes

My data frame looks like -

id       product          amount
1         a                 100
2         b                 300
3         c                 400
4         d                 100

My final data frame should be -

id       product          amount       total
1         a                 100         900
2         b                 300         900
3         c                 400         900
4         d                 100         900

How to do it in Pyspark.

2

2 Answers

0
votes

I think this should work, but not the best approach:

df = df.crossJoin(df.select(F.sum("amount").alias("total")))

another way I can think of is:

df.withColumn('total1', F.lit(df.select(F.sum("A").alias("total")).collect()[0][0]))

IMO, both solutions are fairly expensive.

0
votes

you can do using window operations as below:

 from pyspark.sql.functions import sum
 from pyspark.sql.window import *
 spec = Window.partitionBy()
 newdf = df.withColumn('total',sum('amount').over(spec))

Note: Since no Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.