Goal = Multi-column groupby a dask dataframe, and filter out groups that contain less than 3 rows.
Based on this post: Filtering grouped df in Dask
I'm able to calculate the size of each groupby object, but I cannot figure out how to map it back to my dataframe from the mutli-column groupby. I tried many variations of the following to no avail:
a = input_df.groupby(["FeatureID", "region"])["Target"].size()
s = input_df[["FeatureID", "region"]].map(a)
It works great for a single column groupby.
Solution
Thanks to @jezrael I was able to come up with the following solution:
a = input_df.groupby(["FeatureID", "region"])["Target"].nunique().to_frame("feature_div")
input_df = input_df.join(a, on=["FeatureID", "region"])
# filter out features below diversity threshold
diversified = input_df[input_df.feature_div >= diversity_threshold]