I am building a pivot table in pandas and need to apply a variety of functions to the data. My problem is, for certain aggfuncs, I need to apply them to the entire dataset, whereas for others, I need to drop duplicates before applying the function. My dataframe looks similar to this:
I'm working with a dataframe similar to this:
Name | Metric 1 | Metric 2 | Country | Payment |
---|---|---|---|---|
John | 0.10 | 5.00 | Canada | 100 |
John | 0.30 | 1.00 | Canada | 100 |
John | .40 | Canada | 100 | |
Jane | 0.50 | US | 30 | |
Jane | US | 30 | ||
Jack | UK | 50 | ||
Jack | .70 | .2 | UK | 50 |
Jack | 1.00 | UK | 50 | |
Jack | UK | 50 | ||
Polly | 0.30 | Canada | 150 | |
Mike | Argentina | 80 | ||
Mike | Argentina | 80 | ||
Steve | Canada | 20 | ||
Lily | 0.15 | 1.20 | Mexico | 40 |
Kate | 3.00 | Australia | 90 | |
Edward | 0.05 | Australia | 70 | |
Pete | 0.02 | 0.03 | New Zealand | 20 |
Here's my code:
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(data_to_load['metric data.csv']))
def metricgreaterthanone(x):
n = pd.to_numeric(x, errors="coerce")
return (n > 1).sum() / n.notna().sum()
pd.pivot_table(df,index=['Country'],values=["Name","Metric 1","Metric 2","Payment"],aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,"Metric 2":metricgreaterthanone,"Payment":np.sum},margins=True)
Metric 1 and Metric 2 are unique for each row and must be used in the function without dropping any duplicates. The payment, however, is repeated (that is, John did not receive $300, his total is only $100). If I run the code as is, I get correct output for my pivot table on Metric 1 and Metric 2 but the sum of payments for Canada is $470 whereas it should be $270.
If I apply .drop_duplicates to the pivot table, like so:
pd.pivot_table(df.drop_duplicates(subset=['Name']),index=['Country'],values=["Name","Metric 1","Metric 2","Payment"],aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,"Metric 2":metricgreaterthanone,"Payment":np.sum},margins=True)
Then the sum of payments is correct for each country but the Metric functions are no longer correct. So, my question is: is there a way to drop the duplicates only as they relate to the Payment column?
Thank you for any and all help! (Bonus points if you can also explain why the margins are not working here either!)