0
votes

I am doing a left outer join on two dataframes:

def create_joined_dataframe(left, right, join_column):
    joined_df = pd.merge(pd.DataFrame.from_records(left),
                         pd.DataFrame.from_records(right),
                         how='left',
                         left_on=join_column,
                         right_on='join_col'
                         ).sort_values(by='block_index')
    # convert special nan values to None
    return joined_df.where((pd.notnull(joined_df)), None)

In the above function, 'left' and 'right' are two numpy structured arrays and the cardinality for the 'join_column' is mostly one is to one. Both dataframes contain around 1.3 million rows.

The 'left' dataframe has a column named 'block_index' using which, I sort the joined dataframe.

I then convert the 'nan' values in the joined dataframe to None values.

The average execution time for this function is 3.22s.

In [19]: %timeit joined_df = create_joined_dataframe(left, right, ('chromosome', 'position'))
1 loop, best of 3: 3.22 s per loop

I want to reduce this execution time considerably (to something like 0.3 s).

I have tried using Cython (without code changes) and it did not produce a significant speed-up.

I am not sure what parts I should refactor to better utilize Cython or if there is some other technique I should be considering. Any suggestions?

Why are you replacing the NaNs with None? Depending on types that step may take longer than the merge is likely unnecessary?chrisb
I realized I was converting Nan values to None just to satisfy my current unit tests. Removing this step shaves 1 second off the execution time. Thank you for the suggestion.Shreeraj