0
votes

I have some data in a pandas dataframe, where the date column is the index, as follows:

date        value A       value B      country        category
ddmmyy      A1            B1           US             Red
ddmmyy      A2            B2           US             Red
ddmmyy      A3            B3           UK             Green
ddmmyy      A4            B4           UK             Green
ddmmyy      A5            B5           US             Red
ddmmyy      A6            B6           US             Red

There are multiple countries (other than UK, US) and multiple categories (other than red and green). The dates represent transactions and occur at an irregular frequency. I want to regroup the data as follows:

date        UK           US    
mmyy       num11        num21
mmyy       num12        num22
mmyy       num13        num23
mmyy       num14        num24

Where 1. The dates are regular consecutive monthly dates 2. num11 for the UK is the weighted average of A1xB1, A2xB2 etc i.e. weighted average for a given month = sum {An x Bn} in the given month, divided by sum of Bn in the given month. And likewise for US.

I have tried using e.g.

dataframe['Val A'].resample('M', how='sum')

Which resamples col A into monthlu buckets by summing the number e.g. A1, A2 etc. per month. But doesnt give me the weighted average I want per country.

Whats the best way to do this in Pandas/Python?

2
Thanks for the question hopefully I have clarified above - let me know if not.user7289
I have updated my answer with the new information. The category is still ignored?Graipher
Yes you're right, category is ignored, and I intend to treat same way as country - will try you answer out and come backuser7289

2 Answers

2
votes

Here is a groupby and apply approach:

import pandas as pd
import numpy as np

def weighted_average(group):
    return (group["value A"] * group["value B"]).sum() / group["value B"].sum()


df = pd.DataFrame({"value A": np.random.randint(1, 100, 10),
                   "value B": np.random.randint(1, 100, 10),
                   "country": np.random.choice(["US", "UK"], 10),
                   "category": np.random.choice(["Red", "Green"], 10)},
                  index=pd.date_range("2018-01-26", "2018-02-04", num=10))
print(df)
#            category country  value A  value B
# 2018-01-26    Green      UK       74       93
# 2018-01-27    Green      UK       57        1
# 2018-01-28    Green      US        6       24
# 2018-01-29    Green      UK       31       89
# 2018-01-30    Green      UK       73       75
# 2018-01-31    Green      US       86       63
# 2018-02-01    Green      US       86       30
# 2018-02-02    Green      US       53       37
# 2018-02-03      Red      UK       50       69
# 2018-02-04      Red      US       98       33

print(df.groupby([pd.Grouper(freq='M'), "country"]).apply(weighted_average)).unstack()
# country            UK         US
# 2018-01-31  58.810078  63.931034
# 2018-02-28  50.000000  77.750000

Note the pandas.Grouper, which groups by month (by default on the index, but you can also supply it the column with key="date", if you don't want to set the index).

If you also want to separate by the category at the same time, you can just add it to the groupby call (df.groupby([pd.Grouper(freq='M'), "country", "category"])...). This will make the index one level deeper, so you have to decide if you want the multi-index on the columns or rows. If you want it on the columns, just add another call to unstack() at the end.

0
votes

I would create a new column mmyy - if the date column is a date or a string you might have to use datetime module. Then group by mmyy, and get the total sum of A and B, and then create a new column which is just A/B