Can you drop duplicates for only one aggfunc in a pandas pivot table?

Question

I am building a pivot table in pandas and need to apply a variety of functions to the data. My problem is, for certain aggfuncs, I need to apply them to the entire dataset, whereas for others, I need to drop duplicates before applying the function. My dataframe looks similar to this:

I'm working with a dataframe similar to this:

Name	Metric 1	Metric 2	Country	Payment
John	0.10	5.00	Canada	100
John	0.30	1.00	Canada	100
John		.40	Canada	100
Jane	0.50		US	30
Jane			US	30
Jack			UK	50
Jack	.70	.2	UK	50
Jack	1.00		UK	50
Jack			UK	50
Polly	0.30		Canada	150
Mike			Argentina	80
Mike			Argentina	80
Steve			Canada	20
Lily	0.15	1.20	Mexico	40
Kate	3.00		Australia	90
Edward	0.05		Australia	70
Pete	0.02	0.03	New Zealand	20

Here's my code:

import pandas as pd
import io
df = pd.read_csv(io.BytesIO(data_to_load['metric data.csv']))

def metricgreaterthanone(x):
    n = pd.to_numeric(x, errors="coerce")
    return (n > 1).sum() / n.notna().sum()

pd.pivot_table(df,index=['Country'],values=["Name","Metric 1","Metric 2","Payment"],aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,"Metric 2":metricgreaterthanone,"Payment":np.sum},margins=True)

Metric 1 and Metric 2 are unique for each row and must be used in the function without dropping any duplicates. The payment, however, is repeated (that is, John did not receive $300, his total is only $100). If I run the code as is, I get correct output for my pivot table on Metric 1 and Metric 2 but the sum of payments for Canada is $470 whereas it should be $270.

If I apply .drop_duplicates to the pivot table, like so:

pd.pivot_table(df.drop_duplicates(subset=['Name']),index=['Country'],values=["Name","Metric 1","Metric 2","Payment"],aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,"Metric 2":metricgreaterthanone,"Payment":np.sum},margins=True)

Then the sum of payments is correct for each country but the Metric functions are no longer correct. So, my question is: is there a way to drop the duplicates only as they relate to the Payment column?

Thank you for any and all help! (Bonus points if you can also explain why the margins are not working here either!)

Ben.T Ben.T · Accepted Answer · 2021-04-29T14:06:44

As you don't perform other operation on the column payment, you could mask (hence replace by nan) the values where duplicated in column Name like.

pd.pivot_table(df.assign(Payment=lambda x: x['Payment'].mask(x['Name'].duplicated())),
               index=['Country'],values=["Name","Metric 1","Metric 2","Payment"],
               aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,
                        "Metric 2":metricgreaterthanone,"Payment":np.sum},
               margins=True)

             Metric 1  Metric 2  Name  Payment
Country                                       
Argentina         NaN       NaN     1     80.0
Australia         0.5       NaN     2    160.0
Canada            0.0  0.333333     3    270.0
Mexico            0.0  1.000000     1     40.0
New Zealand       0.0  0.000000     1     20.0
UK                0.0  0.000000     1     50.0
US                0.0       NaN     1     30.0
All               0.0  0.666667     3    160.0

and for the margins issue, you can see this question

Can you drop duplicates for only one aggfunc in a pandas pivot table?

1 Answers