0
votes

I have a data frame that looks as follows:

df = pd.DataFrame({"A":[10,0,30,40,0,60,70,80,90]}, index = pd.date_range(start='1/1/2020', end='1/09/2020'))
df


            A
2020-01-01  10
2020-01-02  0
2020-01-03  30
2020-01-04  40
2020-01-05  0
2020-01-06  60
2020-01-07  70
2020-01-08  80
2020-01-09  90

I want to loop over the data frame to get t0 and t1 which represent the first and last dates, respectively (i.e. index value) for each range of non-zero values of A.

In the table above, I want to get the following values for t0 and t1:

t0 = 2020-01-01 , t1 = 2020-01-01

t0 = 2020-01-03 , t1 = 2020-01-04

t0 = 2020-01-06 , t1 = 2020-01-09

Is there a simple way to do this within Pandas?

2

2 Answers

1
votes

One option would be to create a Boolean Index based on where A=0. Then groupby the cumsum of that Index and aggregate the first and last dates (assumes index is sorted as it is here):

new_df = df.reset_index()
m = new_df['A'].eq(0)
new_df = (
    new_df.groupby(m.cumsum()[~m])
        .agg(t0=('index', 'first'), t1=('index', 'last'))
        .reset_index(drop=True)
)
          t0         t1
0 2020-01-01 2020-01-01
1 2020-01-03 2020-01-04
2 2020-01-06 2020-01-09

An alternative to grab min and max index values instead of first and last:

new_df = df.reset_index()
m = new_df['A'].eq(0)
new_df = (
    new_df.groupby(m.cumsum()[~m])
        .agg(t0=('index', 'min'), t1=('index', 'max'))
        .reset_index(drop=True)
)
          t0         t1
0 2020-01-01 2020-01-01
1 2020-01-03 2020-01-04
2 2020-01-06 2020-01-09

Indexing:

m:

0    False
1     True
2    False
3    False
4     True
5    False
6    False
7    False
8    False
Name: A, dtype: bool

Create Groups Separated by where 0s are:

m.cumsum()
0    0
1    1
2    1
3    1
4    2
5    2
6    2
7    2
8    2
Name: A, dtype: int32

Self filtered to exclude 0 rows:

m.cumsum()[~m]
0    0
2    1
3    1
5    2
6    2
7    2
8    2
Name: A, dtype: int32

Then group these rows together to determine associated dates.

0
votes

You can use groupby and agg to find the first date and the last date for each range at the same time.

Try this:

df = df.reset_index().groupby('A').agg({'first','last'})
df.columns = df.columns.droplevel(0)
print(df)

Output:

        first        last
A                       
0    2020-01-02   2020-01-05
10   2020-01-01   2020-01-01
30   2020-01-03   2020-01-03
40   2020-01-04   2020-01-04
60   2020-01-06   2020-01-06
70   2020-01-07   2020-01-07
80   2020-01-08   2020-01-08
90   2020-01-09   2020-01-09