Fast way to manipulate multiple large pandas dataframes

Question

I have two very large dataframes, say df1 & df2, each containing over 1M rows and multiple columns. Both df1 and df2 contain an ID column. I need to perform the following operation:

check the first row of df1 and match it with the N-th row of df2 with the same ID.
Take the value of the N-th row of the 3rd column of df2 and place it in a new column in df1.
Repeat 1.&2. for all 1 million rows of df1.

Which way is the fastest to perform this operation? So far what I thought has been to transform both dataframes ID columns into index (with pd.set_index()) and then use the .apply() function to loop over all rows. But it is taking quite a long time to perform the operation (and I will need to perform a more complex procedure later on).

Quang Hoang Quang Hoang · Accepted Answer · 2021-03-08T18:42:37

You can try:

# enumerate df2 within each ID
enums = df2.groupby('ID').cumcount()

# extract the N-th rows in each `ID`
to_map = df2.loc[enums==N-1, ['ID','col3']]

# map to df1
df1['new_col'] = df1['ID'].map(to_map.set_index('ID')['col3'])

Fast way to manipulate multiple large pandas dataframes

1 Answers