0
votes

I have 2 dataframes df1 & df2. The df2 dataframes is a subset of df1 I extracted to do some cleaning. Both dataframes can be matched on index. I seen a lot of merges on the site. I don't want to add more columns to df1 and the dataframes are not the same size df1 has 1000 rows and df2 has 275 rows so i don't want to replace the entire column. I want to update df1['AgeBin'] with the df2['AgeBin'] values where the indexes for these dataframes have a match.

indexes = df.loc[df.AgeBin.isin(dfage_test.AgeBin.values)].index
df1.at[indexes,'AgeBin'] = df2['AgeBin'].values

This was what I came up with but seems like there is an issue since df's are different sizes

ValueError: Must have equal len keys and value when setting with an iterable

Below is an over simplification. df1 has 26 columns and df2 has 12 columns the Agebin is the last column in both dfs. This is in theory what my objective is

df2
    AgeBin
0     2 
1     3 
2     1 
3     3 


df1
    AgeBin
0     NaN 
1     NaN 
2     NaN 
3     NaN 

df1 after update
    AgeBin
0     2 
1     3 
2     1 
3     3 

Here are dataframe specs

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   PassengerId       1046 non-null   float64 
 1   Survived          714 non-null    category
 2   Pclass            1046 non-null   category
 3   Name              1046 non-null   object  
 4   Sex               1046 non-null   object  
 5   Age               1046 non-null   float64 
 6   SibSp             1046 non-null   float64 
 7   Parch             1046 non-null   float64 
 8   Ticket            1046 non-null   object  
 9   Fare              1046 non-null   float64 
 10  Embarked          1046 non-null   category
 11  FamilySize        1046 non-null   float64 
 12  Surname           1046 non-null   object  
 13  Title             1046 non-null   object  
 14  IsChild           1046 non-null   float64 
 15  isMale            1046 non-null   category
 16  GroupID           1046 non-null   float64 
 17  GroupSize         1046 non-null   float64 
 18  GroupType         1046 non-null   object  
 19  GroupNumSurvived  1046 non-null   float64 
 20  GroupNumPerished  1046 non-null   float64 
 21  LargeGroup        1046 non-null   float64 
 22  SplitFare         1046 non-null   float64 
 23  log10Fare         1046 non-null   float64 
 24  log10SplitFare    1046 non-null   float64 
 25  AgeBin            1046 non-null   category
dtypes: category(5), float64(15), object(6)
memory usage: 221.9+ KB
  

dfageResults.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 5 to 1308
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   AgeBin  263 non-null    category
dtypes: category(1)
memory usage: 12.4 KB

Here are the categories

67] dfageResults.groupby(["AgeBin"])["AgeBin"].count()
AgeBin
0-14      25
15-29    192
30-44     46
Name: AgeBin, dtype: int64

[68] df.groupby(["AgeBin"])["AgeBin"].count()
AgeBin
0-14     107
15-29    462
30-44    301
45-59    136
60+       40
Name: AgeBin, dtype: int64
2
Could you please provide sample data of the two dataframes?Christopher
What you need is df1.combine_first(df2)moys
I tried your solution and it did not work. Thanks tho for helping!Leo Torres

2 Answers

1
votes

Assuming all the indexed in df2 exist in df1 (which I understand is the case) - the below will suffice:

df1.loc[df2.index,:]=df2

In case if the above assumption for index won't hold - this is the alternative (same result - updates only existing indexes in df1):

df1.loc[set(df2.index).intersection(set(df1.index)),:]=df2

Sample output (with more representative sample data):

import pandas as pd
import numpy as np

df1=pd.DataFrame({"AgeBin": [1,2,3,'x', np.nan,np.nan,'a']})

df2=pd.DataFrame({"AgeBin": ['new1', 'new2', 123]}, index=[5,2,3])

print(df1)
print(df2)
df1.loc[df2.index,:]=df2
print(df1)

Outputs:

  AgeBin
0      1
1      2
2      3
3      x
4    NaN
5    NaN
6      a

  AgeBin
5   new1
2   new2
3    123

  AgeBin
0      1
1      2
2   new2
3    123
4    NaN
5   new1
6      a
0
votes

Try:

print('df2')
print(df2)

print('\ndf1')
print(df1)

df1.update(df2)

print('\ndf1 after update')
print(df1)

Output:

df2
  AgeBin
0  2    
1  3    
2  1    
3  3    

df1
   AgeBin
0 NaN    
1 NaN    
2 NaN    
3 NaN    

df1 after update
  AgeBin
0  2    
1  3    
2  1    
3  3