0
votes

I am trying to compare the values of two columns by calculating the difference; however, I want to create a new column having the value based on a condition.

Suppose I have data like this,

Column1 | Column2 | 
   10   |    9    |
   20   |    40   |
   30   |    30   |

I want to create a data like,

Column1 | Column2 | Column3
   10   |    9    |    9
   20   |    40   |   20
   30   |    30   |   30

The first row of Column3 is 9 because the Column2 - Column1 < 0.0 ,therefore, Column3 will have the same value of Column2. Second row is 20 because the difference is positive (Column2 - Column1 > 0.0),therefore, Column3 will have the same value of Column1. Third row, the difference is 0 so Column3 can either value of Column1 or Column2.

I am trying to implement it using pandas and I have written the following code but I get an error:

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Here's my code:

df = pd.read_excel('abc.xlsx')

if( df['Column2'] - df['Column1'] < 0.0 ):
    df['Column3'] = df['Column2']

elif (df['Column2'] - df['Column1'] > 0.0) :
    df['Column3'] = df['Column1']

else:
    df['Column3'] = df['Column1']

print(df)
2
Your code checks for each row to see which conditions are met, however, since it is working on a series, it cant tell what is True/False, ... hence the ambiguity. Using, a vectorized approach however, Pandas can correctly interprete the True/False results and assign the right values accordingly.sammywemmy
thanks very much for the explanation. May I ask why is it necessary for pandas to be able to tell what is True/False...it is just supposed to check the difference of two numbers and check whether it is greater/less/equal to 0 followed by operation. I am sorry if this is a dumb question, I am a noob.silentwraith
when working within python, the if function is executed on a line by line basis; in Pandas/Numpy, vectorization implies working on chunks/the entire column in one go. as such, checking row by row(which you can certainly do with iterrows) wont be efficient. I'm hoping my explanation did not make it worse. Maybe this link can helpsammywemmy
wow, so your answer is focussed on efficiency too. I think I understand the usage a lot more now. Thanks again. Cheers.silentwraith

2 Answers

1
votes

You can compare the two columns, then use numpy where to run a vectorised if/else :

 df["Column3"] = np.where(df.Column1.gt(df.Column2), df.Column2, df.Column1)
0
votes

Alternatively, if you want to do it with loops and if-else conditions,you can try this:

import pandas as pd
df=pd.read_excel('abc.xlsx')
empty_list=[]
for i,j in zip(df['Column1'],df['Column2']):
    difference=j-i
    if difference<0:          #if difference is negative
        empty_list.append(j)  #take value of Column2.
    elif difference>0:                   #if difference is positive
        empty_list.append(difference)    #take value of the difference.
    else:                     #if value is zero
        empty_list.append(i)  #you can take value of any column,here value of Column1 is taken.
        
Column3=pd.DataFrame(empty_list,columns=['Column3'])
new_df=pd.concat([df,Column3],axis=1)
print(new_df)