2
votes

I am trying to add pandas dataframes to another dataframe with different lengths such that the values in the result are aligned with both the (time)index and a key value from a column that is present in all dataframes.

Say I want to combine df1,df2 and df3 and merge on index and column 'id':

df1
            id value1
2015-05-01   1     13
2015-05-01   2     14
2015-05-02   1     15
2015-05-02   2     16

df2
            id  value2
2015-05-01   1       4
2015-05-02   1       5

df3
            id  value2
2015-05-01   2       7
2015-05-02   2       8

What I would like is to get a dataframe that looks like

df
            id   value1 value2
2015-05-01   1       13      4
2015-05-01   2       14      7
2015-05-02   1       15      5
2015-05-02   2       16      8

but I struggle with the merge function.

1

1 Answers

1
votes

If your DataFrames look like this:

import datetime as DT
import numpy as np
import pandas as pd

df1 = pd.DataFrame({'id':[1,2,1,2], 'value1':[13,14,15,16]}, index=pd.DatetimeIndex(['2015-5-1', '2015-5-1', '2015-5-2', '2015-5-2']))
df2 = pd.DataFrame({'id':[1,1], 'value2':[4,5]}, index=pd.DatetimeIndex(['2015-5-1', '2015-5-2']))
df3 = pd.DataFrame({'id':[2,2], 'value2':[7,8]}, index=pd.DatetimeIndex(['2015-5-1', '2015-5-2']))

you could concatenate all the DataFrames:

df = pd.concat([df1,df2,df3])
#             id  value1  value2
# 2015-05-01   1      13     NaN
# 2015-05-01   2      14     NaN
# 2015-05-02   1      15     NaN
# 2015-05-02   2      16     NaN
# 2015-05-01   1     NaN       4
# 2015-05-02   1     NaN       5
# 2015-05-01   2     NaN       7
# 2015-05-02   2     NaN       8

Since the result is being aligned on both the date and the id, it's natural to set id as an index. Then if we stack the DataFrame we get this Series:

series = df.set_index(['id'], append=True).stack()
#             id        
# 2015-05-01  1   value1    13
#             2   value1    14
# 2015-05-02  1   value1    15
#             2   value1    16
# 2015-05-01  1   value2     4
# 2015-05-02  1   value2     5
# 2015-05-01  2   value2     7
# 2015-05-02  2   value2     8
# dtype: float64

Now if we turn around and unstack the Series, the values are aligned based on the remaining index -- the date and the id:

result = series.unstack()

yields

               value1  value2
           id                
2015-05-01 1       13       4
           2       14       7
2015-05-02 1       15       5
           2       16       8

Note that unstack() requires that the remaining index is unique. That means that there are no duplicate (date, id) entries. If there are duplicate entries, then its not clear what the desired output should be. One way to address the issue would be to group by the date and id and aggregate the values. Another option would be to pick one value and drop the others.