pandas aggregate data from two data frames

Question

I have two pandas data frames, with some indexes and some column names in common (like partially overlapping time-series related to common quantities).

I need to merge these two dataframes in a single one containing all the indexes and all the values for each index, keeping the values of the left (right) one in case an index-column combination appears in both data frames.

Both merge and join methods are unhelpful as the merge method will duplicate information I don't need and join causes the same problem.

What's an efficient method to obtain the result I need?

EDIT: If for example I have the two data frames

df1 = pd.DataFrame({
'C1' : [1.1, 1.2, 1.3],
'C2' : [2.1, 2.2, 2.3],
'C3': [3.1, 3.2, 3.3]},
index=['a', 'b', 'c'])

df2 = pd.DataFrame({
'C3' : [3.1, 3.2, 33.3],
'C4' : [4.1, 4.2, 4.3]},
index=['b', 'c', 'd'])

What I need is a method that allows me to create:

merged = pd.DataFrame({
'C1': [1.1, 1.2, 1.3, 'nan'],
'C2': [2.1, 2.2, 2.3, 'nan'],
'C3': [3.1, 3.2, 3.3, 33.3], 
'C4': ['nan', 4.1, 4.2, 4.3]},
index=['a', 'b', 'c', 'd'])

unutbu unutbu · Accepted Answer · 2018-06-25T13:45:03

Here are three possibilities:

Use concat/groupby: First concatenate both DataFrames vertically. Then group by the index and select the first row in each group.
Use combine_first: Make a new index which is the union of df1 and df2. Reindex df1 using the new index. Then use combine_first to fill in NaNs with values from df2.
Use manual construction: We could use df2.index.difference(df1.index) to find exactly which rows need to be added to df1. So we could manually select those rows from df2 and concatenate them on to df1.

For small DataFrames, using_concat is faster. For larger DataFrames, using_combine_first appears to be slightly faster than the other options:

import numpy as np
import pandas as pd
import perfplot

def make_dfs(N):
    df1 = pd.DataFrame(np.random.randint(10, size=(N,2)))
    df2 = pd.DataFrame(np.random.randint(10, size=(N,2)), index=range(N//2,N//2 + N))
    return df1, df2

def using_concat(dfs):
    df1, df2 = dfs
    result = pd.concat([df1,df2], sort=False)
    n = result.index.nlevels
    return result.groupby(level=range(n)).first()

def using_combine_first(dfs):
    df1, df2 = dfs
    index = df1.index.union(df2.index)
    result = df1.reindex(index)
    result = result.combine_first(df2)
    return result

def using_manual_construction(dfs):
    df1, df2 = dfs
    index = df2.index.difference(df1.index)
    cols = df2.columns.difference(df1.columns)
    result = pd.concat([df1, df2.loc[index]], sort=False)
    result.loc[df2.index, cols] = df2
    return result

perfplot.show(
    setup=make_dfs,
    kernels=[using_concat, using_combine_first, 
             using_manual_construction],
    n_range=[2**k for k in range(5,21)],
    logx=True,
    logy=True,
    xlabel='len(df)')

pandas aggregate data from two data frames

2 Answers