handling too many categorical features using scikit-learn

Question

I am quite new to scikit-learn and I am trying to use this package to make predictions on the income data. This maybe a duplicate question as I saw another post on this but I am looking for an easy example to understand what's expected from scikit-learn estimators.

The data I have is of the following structure where many features are categorical (eg: workclass, education..)

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Example records:

38   Private    215646   HS-grad    9    Divorced    Handlers-cleaners   Not-in-family   White   Male   0   0   40   United-States   <=50K
53   Private    234721   11th   7    Married-civ-spouse  Handlers-cleaners   Husband     Black   Male   0   0   40   United-States   <=50K
30   State-gov  141297   Bachelors  13   Married-civ-spouse  Prof-specialty  Husband     Asian-Pac-Islander  Male   0   0   40   India   >50K

I am having a hard time handling the categorical features as most of the models in sckit-learn expect all features to be numbers? They do provide some classes to transform/encode such features (like Onehotencoder, DictVectorizer) but I cannot find a way to use these on my data. I know there are quite a number of steps involved here before I fully encode them to numbers but I am just wondering if anybody knows a simpler and efficient(since there are too many such features) way that can be understood with an example. I vaguely know DictVectorizer is the way to go but need help in how to proceed here.

Fred Foo Fred Foo · Accepted Answer · 2013-10-08T09:10:31

Here's some example code using DictVectorizer. First, let set up some data in the Python shell. I leave reading from a file up to you.

>>> features = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
...             "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country"]
>>> input_text = """38   Private    215646   HS-grad    9    Divorced    Handlers-cleaners   Not-in-family   White   Male   0   0   40   United-States   <=50K
... 53   Private    234721   11th   7    Married-civ-spouse  Handlers-cleaners   Husband     Black   Male   0   0   40   United-States   <=50K
... 30   State-gov  141297   Bachelors  13   Married-civ-spouse  Prof-specialty  Husband     Asian-Pac-Islander  Male   0   0   40   India   >50K
... """

Now, parse these:

>>> for ln in input_text.splitlines():
...     values = ln.split()
...     y.append(values[-1])
...     d = dict(zip(features, values[:-1]))
...     samples.append(d)

What have we got now? Let's check:

>>> from pprint import pprint
>>> pprint(samples[0])
{'age': '38',
 'capital-gain': '0',
 'capital-loss': '0',
 'education': 'HS-grad',
 'education-num': '9',
 'fnlwgt': '215646',
 'hours-per-week': '40',
 'marital-status': 'Divorced',
 'native-country': 'United-States',
 'occupation': 'Handlers-cleaners',
 'race': 'White',
 'relationship': 'Not-in-family',
 'sex': 'Male',
 'workclass': 'Private'}
>>> print(y)
['<=50K', '<=50K', '>50K']

These samples are ready for DictVectorizer, so pass them:

>>> from sklearn.feature_extraction import DictVectorizer
>>> dv = DictVectorizer()
>>> X = dv.fit_transform(samples)
>>> X
<3x29 sparse matrix of type '<type 'numpy.float64'>'
        with 42 stored elements in Compressed Sparse Row format>

Et voila, you have X and y that can be passed to an estimator, provided it supports sparse matrices. (Otherwise, pass sparse=False to the DictVectorizer constructor.)

Test samples can similarly be passed to DictVectorizer.transform; if there are feature/value combinations in the test set that do not occur in the training set, these will simply be ignored (because the learned model cannot do anything sensible with them anyway).

handling too many categorical features using scikit-learn

1 Answers