1
votes

I am building a pivot table in pandas and need to apply a variety of functions to the data. My problem is, for certain aggfuncs, I need to apply them to the entire dataset, whereas for others, I need to drop duplicates before applying the function. My dataframe looks similar to this:

I'm working with a dataframe similar to this:

Name Metric 1 Metric 2 Country Payment
John 0.10 5.00 Canada 100
John 0.30 1.00 Canada 100
John .40 Canada 100
Jane 0.50 US 30
Jane US 30
Jack UK 50
Jack .70 .2 UK 50
Jack 1.00 UK 50
Jack UK 50
Polly 0.30 Canada 150
Mike Argentina 80
Mike Argentina 80
Steve Canada 20
Lily 0.15 1.20 Mexico 40
Kate 3.00 Australia 90
Edward 0.05 Australia 70
Pete 0.02 0.03 New Zealand 20

Here's my code:

import pandas as pd
import io
df = pd.read_csv(io.BytesIO(data_to_load['metric data.csv']))

def metricgreaterthanone(x):
    n = pd.to_numeric(x, errors="coerce")
    return (n > 1).sum() / n.notna().sum()

pd.pivot_table(df,index=['Country'],values=["Name","Metric 1","Metric 2","Payment"],aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,"Metric 2":metricgreaterthanone,"Payment":np.sum},margins=True)

Metric 1 and Metric 2 are unique for each row and must be used in the function without dropping any duplicates. The payment, however, is repeated (that is, John did not receive $300, his total is only $100). If I run the code as is, I get correct output for my pivot table on Metric 1 and Metric 2 but the sum of payments for Canada is $470 whereas it should be $270.

If I apply .drop_duplicates to the pivot table, like so:

pd.pivot_table(df.drop_duplicates(subset=['Name']),index=['Country'],values=["Name","Metric 1","Metric 2","Payment"],aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,"Metric 2":metricgreaterthanone,"Payment":np.sum},margins=True)

Then the sum of payments is correct for each country but the Metric functions are no longer correct. So, my question is: is there a way to drop the duplicates only as they relate to the Payment column?

Thank you for any and all help! (Bonus points if you can also explain why the margins are not working here either!)

1

1 Answers

1
votes

As you don't perform other operation on the column payment, you could mask (hence replace by nan) the values where duplicated in column Name like.

pd.pivot_table(df.assign(Payment=lambda x: x['Payment'].mask(x['Name'].duplicated())),
               index=['Country'],values=["Name","Metric 1","Metric 2","Payment"],
               aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,
                        "Metric 2":metricgreaterthanone,"Payment":np.sum},
               margins=True)

             Metric 1  Metric 2  Name  Payment
Country                                       
Argentina         NaN       NaN     1     80.0
Australia         0.5       NaN     2    160.0
Canada            0.0  0.333333     3    270.0
Mexico            0.0  1.000000     1     40.0
New Zealand       0.0  0.000000     1     20.0
UK                0.0  0.000000     1     50.0
US                0.0       NaN     1     30.0
All               0.0  0.666667     3    160.0

and for the margins issue, you can see this question