432
votes

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.

I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic). Hopefully this makes sense. Any help will be greatly appreciated.

Its almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.

From the dataframe below I need to calculate a new column based on the following spec in SQL:

========================= CRITERIA ===============================

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”

Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”

====================== DATAFRAME ===========================

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White
6
Your particular function is just a long if-else ladder where some variables' values take priority over others. It would be called a priority-decoder in hardware engineering parlance.smci

6 Answers

559
votes

OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".

Next, use the apply function in pandas to apply the function - e.g.

df.apply (lambda row: label_race(row), axis=1)

Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

The resultant dataframe looks like this (scroll to the right to see the new column):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White
312
votes

Since this is the first Google result for 'pandas new column from others', here's a simple example:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

If you get the SettingWithCopyWarning you can do it this way also:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values})

And here's the documentation for apply, and assign.

90
votes

The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


First, define conditions:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

Now, define the corresponding outputs:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.

36
votes

.apply() takes in a function as the first parameter; pass in the label_race function as so:

df['race_label'] = df.apply(label_race, axis=1)

You don't need to make a lambda function to pass in a function.

23
votes

try this,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O/P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

use .loc instead of apply.

it improves vectorization.

.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

for more details visit, .loc docs

Performance metrics:

Accepted Answer:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Proposed Answer:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

0
votes

As @user3483203 pointed out, numpy.select is the best approach

Store your conditional statements and the corresponding actions in two lists

conds = [(df['eri_hispanic'] == 1),(df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1)),(df['eri_nat_amer'] == 1),(df['eri_asian'] == 1),(df['eri_afr_amer'] == 1),(df['eri_hawaiian'] == 1),(df['eri_white'] == 1,])

actions = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']

You can now use np.select using these lists as its arguments

df['label_race'] = np.select(conds,actions,default='Other')

Reference: https://numpy.org/doc/stable/reference/generated/numpy.select.html