1
votes

I have a data frame like this,

col1    col2    col3
 1        2       3
 2        3       4
 4        2       3
 7        2       8
 8        3       4
 9        3       3
 15       1       12

Now I want to group those rows where there difference between two consecutive col1 rows is less than 3. and sum other column values, create another column(col4) with the last value of the group, So the final data frame will look like,

col1    col2    col3    col4
  1       7       10     4
  7       8       15     9

using for loop to do this is tedious, looking for some pandas shortcuts to do it most efficiently.

1
Shouldn't you have a 3rd row in the output since the diff between the last two rows is also greater than 3 (6)? - cs95
@cs95 yes, third row will be there - Kallol

1 Answers

1
votes

You can do a named aggregation on groupby:

(df.groupby(df.col1.diff().ge(3).cumsum(), as_index=False)
   .agg(col1=('col1','first'),
        col2=('col2','sum'),
        col3=('col3','sum'),
        col4=('col1','last'))
)

Output:

   col1  col2  col3  col4
0     1     7    10     4
1     7     8    15     9
2    15     1    12    15

update without named aggregation you can do some thing like this:

groups = df.groupby(df.col1.diff().ge(3).cumsum())
new_df = groups.agg({'col1':'first', 'col2':'sum','col3':'sum'})
new_df['col4'] = groups['col1'].last()