How to classify observations based on their covariates in dataframe and numpy?

Question

I have a dataset with n observations and say 2 variables X1 and X2. I am trying to classify each observation based on a set of conditions on their (X1, X2) values. For example, the dataset looks like

df:
Index     X1    X2
1         0.2   0.8
2         0.6   0.2
3         0.2   0.1
4         0.9   0.3

and the groups are defined by

Group 1: X1<0.5 & X2>=0.5
Group 2: X1>=0.5 & X2>=0.5
Group 3: X1<0.5 & X2<0.5
Group 4: X1>=0.5 & X2<0.5

I'd like to generate the following dataframe.

expected result:
Index     X1    X2    Group
1         0.2   0.8   1
2         0.6   0.2   4
3         0.2   0.1   3
4         0.9   0.3   4

Also, would it be better/faster to work with numpy arrays for this type of problems?

sacuL sacuL · Accepted Answer · 2018-03-02T02:45:04

In answer to your last question, I definitely think pandas is a good tool for this; it could be done in numpy, but pandas is arguably more intuitive when working with dataframes, and fast enough for most applications. pandas and numpy also play really nicely together. For instance, in your case, you can use numpy.select to build your pandas column:

import numpy as np
import pandas as pd
# Lay out your conditions
conditions =  [((df.X1 < 0.5) & (df.X2>=0.5)),
               ((df.X1>=0.5) & (df.X2>=0.5)),
               ((df.X1<0.5) & (df.X2<0.5)),
               ((df.X1>=0.5) & (df.X2<0.5))]

# Name the resulting groups (in the same order as the conditions)
choicelist = [1,2,3,4]

df['group']= np.select(conditions, choicelist, default=-1)

# Above, I've the default to -1, but change as you see fit
# if none of your conditions are met, then it that row would be classified as -1

>>> df
   Index   X1   X2  group
0      1  0.2  0.8      1
1      2  0.6  0.2      4
2      3  0.2  0.1      3
3      4  0.9  0.3      4

How to classify observations based on their covariates in dataframe and numpy?

2 Answers