I am doing a left outer join on two dataframes:
def create_joined_dataframe(left, right, join_column):
joined_df = pd.merge(pd.DataFrame.from_records(left),
pd.DataFrame.from_records(right),
how='left',
left_on=join_column,
right_on='join_col'
).sort_values(by='block_index')
# convert special nan values to None
return joined_df.where((pd.notnull(joined_df)), None)
In the above function, 'left' and 'right' are two numpy structured arrays and the cardinality for the 'join_column' is mostly one is to one. Both dataframes contain around 1.3 million rows.
The 'left' dataframe has a column named 'block_index' using which, I sort the joined dataframe.
I then convert the 'nan' values in the joined dataframe to None values.
The average execution time for this function is 3.22s.
In [19]: %timeit joined_df = create_joined_dataframe(left, right, ('chromosome', 'position'))
1 loop, best of 3: 3.22 s per loop
I want to reduce this execution time considerably (to something like 0.3 s).
I have tried using Cython (without code changes) and it did not produce a significant speed-up.
I am not sure what parts I should refactor to better utilize Cython or if there is some other technique I should be considering. Any suggestions?
NaN
s withNone
? Depending on types that step may take longer than the merge is likely unnecessary? – chrisb