0
votes

Basic question here:

I'm trying to implement a simple classification model for credit card default where I just use model.fit, model.predict on my input data. However, that input data contains both categorical data (like demographic information like Age, Married or Not, Education level) and continuous data (like credit balances).

data.info()

<div class="output"><div class="output_area"><div class="run_this_cell"></div><div class="prompt"></div><div class="output_subarea output_text output_stream output_stdout"><pre>&lt;class 'pandas.core.frame.DataFrame'&gt;
Int64Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
LIMIT_BAL    30000 non-null float64
SEX          30000 non-null int64
EDUCATION    30000 non-null int64
MARRIAGE     30000 non-null int64
AGE          30000 non-null int64
PAY_1        30000 non-null int64
PAY_2        30000 non-null int64
PAY_3        30000 non-null int64
PAY_4        30000 non-null int64
PAY_5        30000 non-null int64
PAY_6        30000 non-null int64
BILL_AMT1    30000 non-null float64
BILL_AMT2    30000 non-null float64
BILL_AMT3    30000 non-null float64
BILL_AMT4    30000 non-null float64
BILL_AMT5    30000 non-null float64
BILL_AMT6    30000 non-null float64
PAY_AMT1     30000 non-null float64
PAY_AMT2     30000 non-null float64
PAY_AMT3     30000 non-null float64
PAY_AMT4     30000 non-null float64
PAY_AMT5     30000 non-null float64
PAY_AMT6     30000 non-null float64
default      30000 non-null int64
dtypes: float64(13), int64(11)
memory usage: 5.7 MB
</pre></div></div></div>

From my understanding, scikit-learn requires all data to be numerical and continuous or specifically coded as a categorical variable. The numerical part is not a problem since all of my data is coded numerically (like 0 for Married, 1 for not) but 3 of my variables (SEX, EDUCATION, and MARRIAGE) are nominal/ordinal and need to be encoded as categorical variables instead of int64 ones.

How do I use encode these 3 variables with scikit-learn's preprocessing module to properly feed these features into a model like Logistic Regression?

Thanks in advance, and please forgive the formatting (feel free to edit or recommend how I can properly include Jupyter Notebook output into a Stack Overflow post).

1
Where do you see the requirement that data be continuous? Are you talking about the difference between a classification vs a regression problem? Once it is numerically encoded, it is fit to be fed into the model. Have you tried, and what is the problem or error you encounter? - G. Anderson
Hmm, I guess the issue is with the encouding then. Right now, all my features are encoded as float64 or int64, but I think I need to encode all the int64 ones to categories with sklearn-preprocessing. I got the requirement that data had to be continuous from Datacamp's course on Supervised Learning. But I also found on scikit-learn's documentation that integer representations of categorical variables don't work. - Gideon Developer
@G.Anderson, thanks for the feedback. I will edit the question for clarity. - Gideon Developer
That makes more sense! Since the categorical variables are already numeric, just use the built-in One-Hot encoder to transform the variables into one-hot columns - G. Anderson

1 Answers

2
votes

Categorical features need more attention in feature engineering, because features like Age, date etc are difficult to encode. There are many ways to encode these features, by analyzing, domain-knowledge and many more.

There is a library category_encoders, which have many functionality to encode such features, by the use of statistics. More you can find here http://contrib.scikit-learn.org/categorical-encoding/

Here, is another good resource, that will shows you the use of encoding method by an example.