3
votes

Short Question Within Pandas, what's the most convenient way to merge two dataframes, such that all entries in the left dataframe receive the first matching value from the right dataframe?

Longer Question Say I have two spreadsheets: people.csv and orders.csv. people.csv contains several columns of information on the person, whereas orders.csv contains the person's full name, and a row of the # of orders that person placed.

I need to create a third csv, output.csv which contains all of the columns from people.csv plus a column from output.csv matched on one of the columns within both spreadsheets (called "FULL_NAME" in one, and "CUSTOMER_FULL_NAME" in the other)

people.csv is sorted on the FULL_NAME field, but contains duplicate rows, such that there are multiple rows with "John Smith" in the FULL_NAME column. There are also duplicate rows within orders.csv but not the same number of duplicates (for example, people.csv may have 4 John Smith entries, but orders.csv may only have two).

If I use the following code:

people = pd.read_csv('people.csv')
orders = pd.read_csv('orders.csv')
full = pd.merge(
    people,
    orders,
    left_on='FULL_NAME',
    right_on='CUSTOMER_FULL_NAME',
)
result.to_csv("output.csv")

... I get a CSV where only two of the rows with "John Smith" in the FULL_NAME field has John Smith's number of orders. The rows directly below have no value in that field. That's because output.csv only contained two rows with matching values for John Smith, whereas people.csv had 4.

Is there a convenient method within Pandas to set the value of one column to the first matching column in the other dataframe, such that all 4 entries contain the first matching value from orders.csv?

EDIT Full current version of my script, returning CSVs containing rows not set with the expected values:

import pandas as pd

community = pd.read_csv("orders.csv")
full = pd.read_csv("people.csv")
result = pd.merge(
    full,
    community.drop_duplicates(subset=['FULL_NAME'], keep='first'),
    left_on="CUSTOMER_FULL_NAME",
    right_on="FULL_NAME",
    how='left',
)
result.to_csv("output.csv")

So I think I'm missing something else here, because some of the rows are matching in the expected way. Here's an example from the ouput file:

ID      FULL_NAME   EMPLOYER            DIVISION            ORDER #
7350    John Smith  RiteAid             Clinical Research   25
7351    John Smith  RiteAid             Clinical Research   25
7352    John Smith  Costco              Sales   
7353    John Smith  Costco              Sales   

This John Smith rows doesn't have a duplicate value within the orders.csv file so I do think this is working since two of the rows got it. However, I didn't get a match on the John Smith rows that list Costco rather than RiteAid (or other different fields). This surprises me since I thought the index check was only on the FULL_NAME field.

Any ideas on why the other rows might not be filled in?

1

1 Answers

5
votes

You can use drop_duplicates on subset=['CUSTOMER_FULL_NAME'] in the merge with how='left' to keep all rows from people such as:

full = pd.merge(
    people,
    orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first'), #here the differance
    left_on='FULL_NAME',
    right_on='CUSTOMER_FULL_NAME',
    how='left' #and add the how='left'
)

So orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first') will contain only once each name and during the merge, the matching will be with only this unique name