python - Compare two columns using pandas

Question

Using this as a starting point:

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

Out[8]: 
  one  two three
0   10  1.2   4.2
1   15  70   0.03
2    8   5     0

I want to use something like an if statement within pandas.

if df['one'] >= df['two'] and df['one'] <= df['three']:
    df['que'] = df['one']

Basically, check each row via the if statement, create new column.

The docs say to use .all but there is no example...

@Merlin: If you have numeric data in a column, it is best not to mix it with strings. Doing so changes the column's dtype to object. This allows arbitrary Python objects to be stored in the column, but it comes at the cost of slower numeric computation. Thus if the column is storing numeric data, using NaNs for not-a-numbers is preferable. — unutbu
Having integers as strings and trying to do comparison on them looks odd: a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]. This creates confusing results with "correct" code: df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])] yields 10 for the first line, while it should yield NaN if the input would have been integers. — Primer

unutbu unutbu · Accepted Answer · 2014-12-14T23:51:22

You could use np.where. If cond is a boolean array, and A and B are arrays, then

C = np.where(cond, A, B)

defines C to be equal to A where cond is True, and B where cond is False.

import numpy as np
import pandas as pd

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
                     , df['one'], np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03  NaN
2   8    5     0  NaN

If you have more than one condition, then you could use np.select instead. For example, if you wish df['que'] to equal df['two'] when df['one'] < df['two'], then

conditions = [
    (df['one'] >= df['two']) & (df['one'] <= df['three']), 
    df['one'] < df['two']]

choices = [df['one'], df['two']]

df['que'] = np.select(conditions, choices, default=np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03   70
2   8    5     0  NaN

If we can assume that df['one'] >= df['two'] when df['one'] < df['two'] is False, then the conditions and choices could be simplified to

conditions = [
    df['one'] < df['two'],
    df['one'] <= df['three']]

choices = [df['two'], df['one']]

(The assumption may not be true if df['one'] or df['two'] contain NaNs.)

Note that

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

defines a DataFrame with string values. Since they look numeric, you might be better off converting those strings to floats:

df2 = df.astype(float)

This changes the results, however, since strings compare character-by-character, while floats are compared numerically.

In [61]: '10' <= '4.2'
Out[61]: True

In [62]: 10 <= 4.2
Out[62]: False

python - Compare two columns using pandas

8 Answers