2
votes

How can we use Coalesce with multiple data frames.

columns_List = Emp_Id, Emp_Name, Dept_Id...

I have two data frames getting used in python script. df1[Columns_List] , df2[columns_List]. In both the dataframes i have same columns used but i will be having different values in both dataframes.

How can i use Coalesce so that lets say :In Dataframe df1[Columns_List] -- I have Emp_Name null then i want to pick Emp_Name from df2[Columns_list].

I am trying to create an output CSV file.

Please sorry if my framing of question is wrong..

Please find below sample data.

For Dataframe1 -- df1[Columns_List] .. Please find below output

EmpID,Emp_Name,Dept_id,DeptName
1,,1,
2,,2,

For Dataframe2 -- df2[Columns_List] .. Please find below output

EmpID,Emp_Name,Dept_id,DeptName
1,XXXXX,1,Sciece
2,YYYYY,2,Maths

I have source as Json file. Once i parse the data by python , i am using 2 dataframes in the same script. In Data frame 1 ( df1) i have Emp_Name & Dept_Name as null. In that case i want to pick data from Dataframe2 (df2).

In the above example i have provided few columns. But i may have n number of columns. but column ordering and column names will be always same. I am trying to achieve in such a way if any of the column from df1 is null then i want to pick value from df2.

Is that possible.. Please help me with any suggestionn...

3
@sammywemmy -- I have added some sample output data. Please let me know if i need to add any more details - Arya
Do you want to replace values which are Null or which are empty? By null I mean numpy.NaN, by empty could be an empty string. - Valentino
@Valentino .. Thanks a lot for the reply. Please consider it as empty.. - Arya

3 Answers

2
votes

You can use pandas.DataFrame.combine. This method does what you need: it builds a dataframe taking elements from two dataframes according to a custom function.
You can then write a custom function which picks the element from dataframe one unless that is null, in which case the element is taken from dataframe two.

Consider the two following dataframe. I built them according to your examples but with a small difference to emphatize that only emtpy string will be replaced:

columnlist = ["EmpID", "Emp_Name", "Dept_id", "DeptName"]

df1 = pd.DataFrame([[1, None, 1, np.NaN], [2, np.NaN, 2, None]], columns=columnlist)
df2 = pd.DataFrame([[1, "XXX", 2, "Science"], [2, "YYY", 3, "Math"]], columns=columnlist)

They are:

df1
   EmpID  Emp_Name  Dept_id  DeptName
0      1       NaN        1       NaN
1      2       NaN        2       NaN

df2
   EmpID Emp_Name  Dept_id DeptName
0      1      XXX        1  Science
1      2      YYY        3     Math

What you need to do is:

ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))

to get ddf:

ddf
   EmpID Emp_Name  Dept_id DeptName
0      1      XXX        1  Science
1      2      YYY        2     Math

As you can see, only Null values in df1 have been replaced with the corresponding values in df2.

EDIT: A bit deeper explanation

Since I've been asked in the comments, let me give a bit of explanation more on the solution:

ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))

Is a bit compact, but there is nothing much than some basic python techiques like list comprehensions, plus the use of pandas.DataFrame.combine. The pandas method is detailed in the docs I linked above. It compares the two dataframes column by column: the columns are passed to a custom function which must return a pandas.Series. This Series become a column in the returned dataframe.
In this case, the custom function is a lambda, which uses a list comprehension to loop over the pairs of elements (one from each column) and pick only one element of the pair (the first if not null, otherwise the second).

2
votes

You can use a mask to get null values and replace those. The best part is that you don't have to eyeball anything; the function will find what to replace for you.

You can also adjust the pd.DataFrame.select_dtypes() function to suit your needs, or just go through multiple dtypes with appropriate conversion and detection measures being used.

import pandas as pd

ddict1 = {
    'EmpID':[1,2],
    'Emp_Name':['',''],
    'Dept_id':[1,2],
    'DeptName':['',''],
}

ddict2 = {
    'EmpID':[1,2],
    'Emp_Name':['XXXXX','YYYYY'],
    'Dept_id':[1,2],
    'DeptName':['Sciece','Maths'],
}

df1 = pd.DataFrame(ddict1)
df2 = pd.DataFrame(ddict2)


def replace_df_values(df_A, df_B):
    ## Select object dtypes
    for i in df_A.select_dtypes(include=['object']):
        ### Check to see if column contains missing value
        if len(df_A[df_A[i].str.contains('')]) > 0:
            ### Create mask for zero-length values (or null, your choice)
            mask = df_A[i] == ''
            ### Replace on 1-for-1 basis using .loc[]
            df_A.loc[mask, i] = df_B.loc[mask, i]

### Pass dataframes in reverse order to cover both scenarios
replace_df_values(df1, df2)
replace_df_values(df2, df1)

Initial values for df1:

   EmpID Emp_Name  Dept_id DeptName
0      1                 1         
1      2                 2         

Output for df1 after running function:

   EmpID Emp_Name  Dept_id DeptName
0      1    XXXXX        1   Sciece
1      2    YYYYY        2    Maths
1
votes

I replicated your dataframes:

# df1
    EmpID   Emp_Name    Dept_id DeptName
0   1       1   
1   2       2   

# df2
    EmpID   Emp_Name    Dept_id DeptName
0   1   XXXXX   1   Sciece
1   2   YYYYY   2   Maths
  1. If you want to replace missing values (NaN) from df1.column with existing values from df2.column, you could use .fillna(). For example:
df1['Emp_Name'].fillna(df2['Emp_Name'], inplace=True)

# df1

    EmpID   Emp_Name    Dept_id DeptName
0   1   XXXXX   1   
1   2   YYYYY   2   

  1. If you want to replace all values from a given column with the values from the same column of another dataframe, you could use list comprehension.
df1['DeptName'] = [ each for each in list(df2['DeptName'])]

    EmpID   Emp_Name    Dept_id DeptName
0   1   XXXXX   1   Sciece
1   2   YYYYY   2   Maths

I'm sure there's a better way to do this, but I hope this helps!