python - Pandas DataFrame aggregate function using multiple columns

86

votes

Is there a way to write an aggregation function as is used in DataFrame.agg method, that would have access to more than one column of the data that is being aggregated? Typical use cases would be weighted average, weighted standard deviation funcs.

I would like to be able to write something like

def wAvg(c, w):
    return ((c * w).sum() / w.sum())

df = DataFrame(....) # df has columns c and w, i want weighted average
                     # of c using w as weight.
df.aggregate ({"c": wAvg}) # and somehow tell it to use w column as weights ...

pythonpandas

Nice article addressing this specific SO question: pbpython.com/weighted-average.html - ptim

105

votes

Yes; use the .apply(...) function, which will be called on each sub-DataFrame. For example:

grouped = df.groupby(keys)

def wavg(group):
    d = group['data']
    w = group['weights']
    return (d * w).sum() / w.sum()

grouped.apply(wavg)

9

votes

It is possible to return any number of aggregated values from a groupby object with apply. Simply, return a Series and the index values will become the new column names.

Let's see a quick example:

df = pd.DataFrame({'group':['a','a','b','b'],
                   'd1':[5,10,100,30],
                   'd2':[7,1,3,20],
                   'weights':[.2,.8, .4, .6]},
                 columns=['group', 'd1', 'd2', 'weights'])
df

  group   d1  d2  weights
0     a    5   7      0.2
1     a   10   1      0.8
2     b  100   3      0.4
3     b   30  20      0.6

Define a custom function that will be passed to apply. It implicitly accepts a DataFrame - meaning the data parameter is a DataFrame. Notice how it uses multiple columns, which is not possible with the agg groupby method:

def weighted_average(data):
    d = {}
    d['d1_wa'] = np.average(data['d1'], weights=data['weights'])
    d['d2_wa'] = np.average(data['d2'], weights=data['weights'])
    return pd.Series(d)

Call the groupby apply method with our custom function:

df.groupby('group').apply(weighted_average)

       d1_wa  d2_wa
group              
a        9.0    2.2
b       58.0   13.2

You can get better performance by precalculating the weighted totals into new DataFrame columns as explained in other answers and avoid using apply altogether.

9

votes

My solution is similar to Nathaniel's solution, only it's for a single column and I don't deep-copy the entire data frame each time, which could be prohibitively slow. The performance gain over the solution groupby(...).apply(...) is about 100x(!)

def weighted_average(df, data_col, weight_col, by_col):
    df['_data_times_weight'] = df[data_col] * df[weight_col]
    df['_weight_where_notnull'] = df[weight_col] * pd.notnull(df[data_col])
    g = df.groupby(by_col)
    result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
    del df['_data_times_weight'], df['_weight_where_notnull']
    return result

4

votes

The following (based on Wes McKinney' answer) accomplishes exactly what I was looking for. I'd be happy to learn if there's a simpler way of doing this within pandas.

def wavg_func(datacol, weightscol):
    def wavg(group):
        dd = group[datacol]
        ww = group[weightscol] * 1.0
        return (dd * ww).sum() / ww.sum()
    return wavg


def df_wavg(df, groupbycol, weightscol):
    grouped = df.groupby(groupbycol)
    df_ret = grouped.agg({weightscol:sum})
    datacols = [cc for cc in df.columns if cc not in [groupbycol, weightscol]]
    for dcol in datacols:
        try:
            wavg_f = wavg_func(dcol, weightscol)
            df_ret[dcol] = grouped.apply(wavg_f)
        except TypeError:  # handle non-numeric columns
            df_ret[dcol] = grouped.agg({dcol:min})
    return df_ret

The function df_wavg() returns a dataframe that's grouped by the "groupby" column, and that returns the sum of the weights for the weights column. Other columns are either the weighted averages or, if non-numeric, the min() function is used for aggregation.

4

votes

I do this a lot and found the following quite handy:

def weighed_average(grp):
    return grp._get_numeric_data().multiply(grp['COUNT'], axis=0).sum()/grp['COUNT'].sum()
df.groupby('SOME_COL').apply(weighed_average)

This will compute the weighted average of all the numerical columns in the df and drop non-numeric ones.

4

votes

Accomplishing this via groupby(...).apply(...) is non-performant. Here's a solution that I use all the time (essentially using kalu's logic).

def grouped_weighted_average(self, values, weights, *groupby_args, **groupby_kwargs):
   """
    :param values: column(s) to take the average of
    :param weights_col: column to weight on
    :param group_args: args to pass into groupby (e.g. the level you want to group on)
    :param group_kwargs: kwargs to pass into groupby
    :return: pandas.Series or pandas.DataFrame
    """

    if isinstance(values, str):
        values = [values]

    ss = []
    for value_col in values:
        df = self.copy()
        prod_name = 'prod_{v}_{w}'.format(v=value_col, w=weights)
        weights_name = 'weights_{w}'.format(w=weights)

        df[prod_name] = df[value_col] * df[weights]
        df[weights_name] = df[weights].where(~df[prod_name].isnull())
        df = df.groupby(*groupby_args, **groupby_kwargs).sum()
        s = df[prod_name] / df[weights_name]
        s.name = value_col
        ss.append(s)
    df = pd.concat(ss, axis=1) if len(ss) > 1 else ss[0]
    return df

pandas.DataFrame.grouped_weighted_average = grouped_weighted_average

1

votes

Here's a solution which has the following benefits:

You don't need to define a function in advance
You can use it within a pipe (since it's using lambda)
You can name the resulting column

:

df.groupby('group')
  .apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights)})

You can also use the same code to perform multiple aggregations:

df.groupby('group')
  .apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights), 
'regular_average': np.average(x.data)}))

0

votes

You can implement this function in the following way:

(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()

For example:

df = pd.DataFrame({'groups': [1, 1, 2, 2], 'c': [3, 3, 4, 4], 'w': [5, 5, 6, 6]})
(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()

Result:

groups
1    3.0
2    4.0
dtype: float64

python - Pandas DataFrame aggregate function using multiple columns

8 Answers