If we have a Pandas data frame consisting of a column of categories and a column of values, we can remove the mean in each category by doing the following:
df["DemeanedValues"] = df.groupby("Category")["Values"].transform(lambda g: g - numpy.mean(g))
As far as I understand, Spark dataframes do not directly offer this group-by/transform operation (I am using PySpark on Spark 1.5.0). So, what is the best way to implement this computation?
I have tried using a group-by/join as follows:
df2 = df.groupBy("Category").mean("Values")
df3 = df2.join(df)
But it is very slow since, as I understand, each category requires a full scan of the DataFrame.
I think (but have not verified) that I can speed this up a great deal if I collect the result of the group-by/mean into a dictionary, and then use that dictionary in a UDF as follows:
nameToMean = {...}
f = lambda category, value: value - nameToMean[category]
categoryDemeaned = pyspark.sql.functions.udf(f, pyspark.sql.types.DoubleType())
df = df.withColumn("DemeanedValue", categoryDemeaned(df.Category, df.Value))
Is there an idiomatic way to express this type of operation without sacrificing performance?