My raw data comes in a tabular format. It contains observations from different variables. Each observation with the variable name, the timestamp and the value at that time.
Variable [string], Time [datetime], Value [float]
The data is stored as Parquet in HDFS and loaded into a Spark Dataframe (df). From that dataframe.
Now I want to calculate default statistics like Mean, Standard Deviation and others for each variable. Afterwards, once the Mean has been retrieved, I want to filter/count those values for that variable that are closely around the Mean.
Therefore I need to get the mean for each variable first. This is why I'm using GroupBy to get the statistics for each variable (not for the whole dataset).
df_stats = df.groupBy(df.Variable).agg( \
count(df.Variable).alias("count"), \
mean(df.Value).alias("mean"), \
stddev(df.Value).alias("std_deviation"))
With the Mean for each variable I then can filter those values (just the count) for that specific variable that are around the Mean. Therefore I need all observations (values) for that variable. Those values are in the original dataframe df and not in the aggregated/grouped dataframe df_stats.
Finally I want one dataframe like the aggregated/grouped df_stats with a new column "count_around_mean".
I was thinking to use df_stats.map(...) or df_stats.join(df, df.Variable). But I'm stuck on the red arrows :(
Question: How would you realize that?
Temporary Solution: Meanwhile I'm using a solution that is based on your idea. But the range-functions for stddev range 2 and 3 does not work. It always yields an
AttributeError saying NullType has no _jvm
from pyspark.sql.window import Window
from pyspark.sql.functions import *
from pyspark.sql.types import *
w1 = Window().partitionBy("Variable")
w2 = Window.partitionBy("Variable").orderBy("Time")
def stddev_pop_w(col, w):
#Built-in stddev doesn't support windowing
return sqrt(avg(col * col).over(w) - pow(avg(col).over(w), 2))
def isInRange(value, mean, stddev, radius):
try:
if (abs(value - mean) < radius * stddev):
return 1
else:
return 0
except AttributeError:
return -1
delta = col("Time").cast("long") - lag("Time", 1).over(w2).cast("long")
#f = udf(lambda (value, mean, stddev, radius): abs(value - mean) < radius * stddev, IntegerType())
f2 = udf(lambda value, mean, stddev: isInRange(value, mean, stddev, 2), IntegerType())
f3 = udf(lambda value, mean, stddev: isInRange(value, mean, stddev, 3), IntegerType())
df \
.withColumn("mean", mean("Value").over(w1)) \
.withColumn("std_deviation", stddev_pop_w(col("Value"), w1)) \
.withColumn("delta", delta)
.withColumn("stddev_2", f2("Value", "mean", "std_deviation")) \
.withColumn("stddev_3", f3("Value", "mean", "std_deviation")) \
.show(5, False)
#df2.withColumn("std_dev_3", stddev_range(col("Value"), w1)) \