Cumulative Deviation in Spark SQL

Question

I have a dataframe of this format:

Date        |   Return
01/01/2015       0.0
02/02/2015       0.04
03/02/2015       0.06
04/02/2015       0.16

I need to calculate cumulative standard deviation for each of the rows and also select the number of rows above it. So my result will look somewhat like this:

Date          | Rows above | Compounded
01/01/2015         0         0(First element to be kept zero)
02/02/2015         1         0.02828427125(Std_Dev of 0,0.04)         
03/02/2015         2         0.03055050463(Std_Dev of 0,0.04,0.06)
04/02/2015         3         0.06806859286(Std_Dev of 0,0.04,0.06,0.16)

I am new to SparkSQL and specially new to window functions. So answers in Java will be highly appreciable. Thanks.

Yeah. There are other columns also but i dont think it will be required. You can groupBy date as well. — Anwesh Sinha Ray

Assaf Mendelson Assaf Mendelson · Accepted Answer · 2016-12-13T11:19:46

You can actually do it using standard operations and window functions. Basically standard deviation can be calculation by calculating sum X^2, sum X and n (number of elements). So do an aggregation for all three in a window function and then calculate the standard deviation from these.

so something like this (freeform pyspark):

 window = Window.sortby("Date")
 aggDF = df.agg(sum(df["Returns"]).alias("sumx"), sum(df["Returns"]*df["returns"]).alias("sumx2"), count(df["returns"]).alias("cnt"))
 Ex_2 = aggDF["sumx2"] / aggDF["cnt"])
 Ex = aggDF["sumx"] / aggDF["cnt"])
 newDF = df.withcolumn("Var", Ex_2 - (Ex * Ex))

of course you have to handle the case of 0 elements and if you want to do standard deviation instead of variance you should take the square root

Cumulative Deviation in Spark SQL

1 Answers