Short Question Within Pandas, what's the most convenient way to merge two dataframes, such that all entries in the left dataframe receive the first matching value from the right dataframe?
Longer Question
Say I have two spreadsheets: people.csv and orders.csv. people.csv contains several columns of information on the person, whereas orders.csv contains the person's full name, and a row of the # of orders that person placed.
I need to create a third csv, output.csv which contains all of the columns from people.csv plus a column from output.csv matched on one of the columns within both spreadsheets (called "FULL_NAME" in one, and "CUSTOMER_FULL_NAME" in the other)
people.csv is sorted on the FULL_NAME field, but contains duplicate rows, such that there are multiple rows with "John Smith" in the FULL_NAME column. There are also duplicate rows within orders.csv but not the same number of duplicates (for example, people.csv may have 4 John Smith entries, but orders.csv may only have two).
If I use the following code:
people = pd.read_csv('people.csv')
orders = pd.read_csv('orders.csv')
full = pd.merge(
people,
orders,
left_on='FULL_NAME',
right_on='CUSTOMER_FULL_NAME',
)
result.to_csv("output.csv")
... I get a CSV where only two of the rows with "John Smith" in the FULL_NAME field has John Smith's number of orders. The rows directly below have no value in that field. That's because output.csv only contained two rows with matching values for John Smith, whereas people.csv had 4.
Is there a convenient method within Pandas to set the value of one column to the first matching column in the other dataframe, such that all 4 entries contain the first matching value from orders.csv?
EDIT Full current version of my script, returning CSVs containing rows not set with the expected values:
import pandas as pd
community = pd.read_csv("orders.csv")
full = pd.read_csv("people.csv")
result = pd.merge(
full,
community.drop_duplicates(subset=['FULL_NAME'], keep='first'),
left_on="CUSTOMER_FULL_NAME",
right_on="FULL_NAME",
how='left',
)
result.to_csv("output.csv")
So I think I'm missing something else here, because some of the rows are matching in the expected way. Here's an example from the ouput file:
ID FULL_NAME EMPLOYER DIVISION ORDER #
7350 John Smith RiteAid Clinical Research 25
7351 John Smith RiteAid Clinical Research 25
7352 John Smith Costco Sales
7353 John Smith Costco Sales
This John Smith rows doesn't have a duplicate value within the orders.csv file so I do think this is working since two of the rows got it. However, I didn't get a match on the John Smith rows that list Costco rather than RiteAid (or other different fields). This surprises me since I thought the index check was only on the FULL_NAME field.
Any ideas on why the other rows might not be filled in?