group rows in a pandas data frame when the difference of consecutive rows are less than a value

Question

I have a data frame like this,

col1    col2    col3
 1        2       3
 2        3       4
 4        2       3
 7        2       8
 8        3       4
 9        3       3
 15       1       12

Now I want to group those rows where there difference between two consecutive col1 rows is less than 3. and sum other column values, create another column(col4) with the last value of the group, So the final data frame will look like,

col1    col2    col3    col4
  1       7       10     4
  7       8       15     9

using for loop to do this is tedious, looking for some pandas shortcuts to do it most efficiently.

Shouldn't you have a 3rd row in the output since the diff between the last two rows is also greater than 3 (6)? — cs95

Quang Hoang Quang Hoang · Accepted Answer · 2020-06-17T19:28:14

You can do a named aggregation on groupby:

(df.groupby(df.col1.diff().ge(3).cumsum(), as_index=False)
   .agg(col1=('col1','first'),
        col2=('col2','sum'),
        col3=('col3','sum'),
        col4=('col1','last'))
)

Output:

   col1  col2  col3  col4
0     1     7    10     4
1     7     8    15     9
2    15     1    12    15

update without named aggregation you can do some thing like this:

groups = df.groupby(df.col1.diff().ge(3).cumsum())
new_df = groups.agg({'col1':'first', 'col2':'sum','col3':'sum'})
new_df['col4'] = groups['col1'].last()

group rows in a pandas data frame when the difference of consecutive rows are less than a value

1 Answers