4
votes

I have a set of records in Python with an id, at least one attribute, and a set of date ranges. I would like code that takes each id, and combines all the records where the attributes match and there is no gap in the date range.

By no gap in date range, I mean that the end date of one record is greater than or equal to the next record for that id.

For instance, a record with id "10", start date "2016-01-01" and end date "2017-01-01" could be merged with another record with that id, a start date of "2017-01-01", and an end date of "2018-01-01", but it could NOT be merged with a record that started on "2017-01-10", because there'd be a gap from 2017-01-01 to 2017-01-09.

Here are some examples --

Have:

FruitID,FruitType,StartDate,EndDate
1,Apple,2015-01-01,2016-01-01
1,Apple,2016-01-01,2017-01-01
1,Apple,2017-01-01,2018-01-01
2,Orange,2015-01-01,2016-01-01
2,Orange,2016-05-31,2017-01-01
2,Orange,2017-01-01,2018-01-01
3,Banana,2015-01-01,2016-01-01
3,Banana,2016-01-01,2017-01-01
3,Blueberry,2017-01-01,2018-01-01
4,Mango,2015-01-01,2016-01-01
4,Kiwi,2016-09-15,2017-01-01
4,Mango,2017-01-01,2018-01-01

Want:

FruitID,FruitType,NewStartDate,NewEndDate
1,Apple,2015-01-01,2018-01-01
2,Orange,2015-01-01,2016-01-01
2,Orange,2016-05-31,2018-01-01
3,Banana,2015-01-01,2017-01-01
3,Blueberry,2017-01-01,2018-01-01
4,Mango,2015-01-01,2016-01-01
4,Kiwi,2016-09-15,2017-01-01
4,Mango,2017-01-01,2018-01-01

My current solution is below. It provides the results I'm looking for, but performance doesn't seem great for large datasets. Also, my impression is that you generally want to avoid iterating over individual rows of a dataframe when possible. Thank you very much for any assistance you can provide!

import pandas as pd
from dateutil.parser import parse

have = pd.DataFrame.from_items([('FruitID', [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]),
                                ('FruitType', ['Apple', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Banana', 'Banana', 'Blueberry', 'Mango', 'Kiwi', 'Mango']),
                                ('StartDate', [parse(x) for x in ['2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-05-31',
                                                                  '2017-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-09-15', '2017-01-01']]),
                                ('EndDate', [parse(x) for x in ['2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01',
                                                                '2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01']])
                                ])

have.sort_values(['FruitID', 'StartDate'])

rowlist = []
fruit_cur_row = None

for row in have.itertuples():
    if fruit_cur_row is None:
        fruit_cur_row = row._asdict()
        fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)

    elif not(fruit_cur_row.get('FruitType') == row.FruitType):
        rowlist.append(fruit_cur_row)

        fruit_cur_row = row._asdict()
        fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)

    elif (row.StartDate <= fruit_cur_row.get('NewEndDate')):
        fruit_cur_row['NewEndDate'] = max(fruit_cur_row['NewEndDate'], row.EndDate)
    else:
        rowlist.append(fruit_cur_row)
        fruit_cur_row = row._asdict()
        fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)

rowlist.append(fruit_cur_row)
have_mrg = pd.DataFrame.from_dict(rowlist)
print(have_mrg[['FruitID', 'FruitType', 'NewStartDate', 'NewEndDate']])
3
Could you explain what "no gap in the date range" means? I am having trouble understanding the question. Thanks.Riley Hun
I've updated my post to include more detail on "no gap" to try and clarify this.Netbrian

3 Answers

5
votes

Using a nested groupby approach:

def merge_dates(grp):
    # Find contiguous date groups, and get the first/last start/end date for each group.
    dt_groups = (grp['StartDate'] != grp['EndDate'].shift()).cumsum()
    return grp.groupby(dt_groups).agg({'StartDate': 'first', 'EndDate': 'last'})

# Perform a groupby and apply the merge_dates function, followed by formatting.
df = df.groupby(['FruitID', 'FruitType']).apply(merge_dates)
df = df.reset_index().drop('level_2', axis=1)

Note that this method assumes your dates are already sorted. If not, you'll need to use sort_values on your DataFrame first. This method may not work if you have nested date spans.

The resulting output:

   FruitID  FruitType   StartDate     EndDate
0        1      Apple  2015-01-01  2018-01-01
1        2     Orange  2015-01-01  2016-01-01
2        2     Orange  2016-05-31  2018-01-01
3        3     Banana  2015-01-01  2017-01-01
4        3  Blueberry  2017-01-01  2018-01-01
5        4       Kiwi  2016-09-15  2017-01-01
6        4      Mango  2015-01-01  2016-01-01
7        4      Mango  2017-01-01  2018-01-01
2
votes

Great answer root. I have modified your function, so that now it works also when date ranges are intersected. Maybe it will help someone.

def merge_dates(grp):
    dt_groups = (grp['StartDate'] > grp['EndDate'].shift()).cumsum()
    grouped = grp.groupby(dt_groups).agg({'StartDate': 'min', 'EndDate': 'max'})
    if len(grp) == len(grouped):
        return grouped
    else:
        return merge_dates(grouped)
0
votes

Here is what I came up with...

df = pd.melt(data, id_vars=['FruitID', 'FruitType'], var_name='WhichDate', value_name='Date')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['FruitType', 'Date']).drop_duplicates(['FruitType', 'Date'])
df = df.assign(Counter = np.nan)
StartDf = df[df['WhichDate']=='StartDate']
StartDf = StartDf.assign(Counter=np.arange(len(StartDf)))
df[df['WhichDate']=='StartDate'] = StartDf
df.fillna(method='ffill', inplace=True)
s = df.groupby(['Counter', 'FruitID', 'FruitType']).agg({'Date': [min, max]}).rename(columns={'min': 'NewStartDate', 'max': 'NewEndDate'})
s.columns = s.columns.droplevel()
s = s.reset_index()
del s['Counter']
s = s.sort_values(['FruitID', 'FruitType']).reset_index(drop=True)

Which outputs...

   FruitID  FruitType NewStartDate NewEndDate
0        1      Apple   2015-01-01 2018-01-01
1        2     Orange   2015-01-01 2016-01-01
2        2     Orange   2016-05-31 2018-01-01
3        3     Banana   2015-01-01 2017-01-01
4        3  Blueberry   2017-01-01 2018-01-01
5        4       Kiwi   2016-09-15 2017-01-01
6        4      Mango   2015-01-01 2016-01-01
7        4      Mango   2017-01-01 2018-01-01

Explanation

First, I recreated your dataframe.

data = pd.DataFrame({'FruitID' : [1,1,1,2,2,2,3,3,3,4,4,4],
                     'FruitType': ['Apple', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Banana', 'Banana',
                                   'Blueberry', 'Mango', 'Kiwi',
                                   'Mango'],
            'StartDate': ['2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-05-31',
                          '2017-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01',
                          '2016-09-15', '2017-01-01'],
            'EndDate' : ['2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01',
                         '2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01',
                         '2018-01-01']})

Next, I used pandas melt function to reshape the data to long format.

df = pd.melt(data, id_vars=['FruitID', 'FruitType'], var_name='WhichDate', value_name='Date')

Then, I sort by Date for each Fruit Type and remove any rows with duplicate Dates

df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['FruitType', 'Date']).drop_duplicates(['FruitType', 'Date'])

I create a helper column used to mark each row with a StartDate. We will need to do this before doing a groupby. Then use fillna to help partition the groups.

df = df.assign(Counter = np.nan)
StartDf = df[df['WhichDate']=='StartDate']
StartDf = StartDf.assign(Counter=np.arange(len(StartDf)))
df[df['WhichDate']=='StartDate'] = StartDf
df.fillna(method='ffill', inplace=True)

Finally, we use groupby and agg to get the min and max dates for each partition.

s = df.groupby(['Counter', 'FruitID', 'FruitType']).agg({'Date': [min, max]}).rename(columns={'min': 'NewStartDate', 'max': 'NewEndDate'})
s.columns = s.columns.droplevel()
s = s.reset_index()
del s['Counter']
s = s.sort_values(['FruitID', 'FruitType']).reset_index(drop=True)