6
votes

Here are how the dataframes columns look like.

df1='device number', 'date', ....<<10 other columns>> 3500 records

df2='device number', 'date', ....<<9 other columns>> 14,000 records

In each data frame, neither 'device number', nor 'date' are unique. However, their combination is unique to identify a row.

I am trying to form a new data frame which matches the rows from df1 and df2 where both device number and date are equal, and have all the columns from these df1 and df2. The pandas command I am trying is

df3=pd.merge(df1, df2, how='inner', on=['device number', 'date'])

However, df3 gives me a dataframe of shape (14,000, 21). The column number makes sense, but how can the inner join has more rows than any of the left dataframes? Does it mean I have a flaw in my understanding of inner join? Also, how can I achieve the result I described?

1
It would be nice if you can post a code that verify your claims -- e.g. combinations are unique. If both data frames contain unique keys, the result would be (3,500, 21). - Sanghack Lee
@SanghackLee Thanks, I am pretty green in python. The uniqueness comes from my knowledge of the data source and domain knowledge. But any idea how to verify it from the code? I thought of the DataFrame.duplicated function, but not sure how to make it work on the combination of two fields. - Della
Maybe ... print(len(set(zip(df1['device number'], df1['date'])))) and for df2. (Other methods can be found in here - Sanghack Lee

1 Answers

3
votes

Only way I can see this happening... particularly with the 14,000 being the same exact number as the number of records in df2 is if the column combination in df2 are not unique.

You can verify that they are not unique with the following (True if unique)

df2.duplicated(['device number', 'date']).sum() == 0

Or

df.set_index(['device number', 'date']).index.is_unique