efficient mapping of very large dataframes

Question

I have two data frames df1 and df2 with the following columns:

A = ['id', 'a_id', 'c', 'd']
B = ['a_id', 'e', 'f']

For a_id in df1 matching a_id in df2, I need to add e from df2 to df1 row. Df1 is around 7 million rows and df2 is around 15k. I tried the code below but it takes too long. I was wondering if there's a better solution that could speed things up a bit and more memory efficient.

def map_df(row):
    for i, r in df2.iterrows():
        if row['a_id'] == r['a_id']:
            return row2['part_mean_correctness']

df1['e'] = df1.apply (lambda row: map_df(row), axis=1)

Remi Remi · Accepted Answer · 2020-12-01T20:12:22

Not sure I completely understood your issue but are you trying to do a merge? If so, then your code should just look like:

import pandas as pd 
df1 = pd.DataFrame({'a_id':[1, 2, 3, 4, 5], 'c': ["foo", "bar", "baz", "qux", "foobar"]})
df2 = pd.DataFrame({'a_id':[3, 4], 'e': ["bar", "baz"]})
df1 = df1.merge(df2, on=["a_id"], how="left")

Then df1 should look like

   a_id       c    e
0     1     foo  NaN
1     2     bar  NaN
2     3     baz  bar
3     4     qux  baz
4     5  foobar  NaN

efficient mapping of very large dataframes

1 Answers