Backtracking categorical features from one-hot-encoding in scikit-learn?

Question

I am trying to construct a lasso regression prediction model. I encoded all my categorical integer features using a one-hot aka one-of-K scheme using OneHotEncoder in scikit-learn. Based on the result, only 51 parameters actually influence the prediction model. I want to investigate these parameters, but they are encoded as described above. Do you know how can I extract which categorical integer feature corresponds to which one hot encoded array? Thanks!

Possible duplicate of How to reverse sklearn.OneHotEncoder transform to recover original data? — Mack

BMW BMW · Accepted Answer · 2015-12-29T23:02:09

Using the active_features_, feature_indices_, and n_values_ attributes of sklearn.preprocessing.OneHotEncoder, a vector of the categorical features ordered by their 'position' in the one-hot array can be created as follows:

import numpy as np
from sklearn import preprocessing

enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
enc.active_features_ - np.repeat(enc.feature_indices_[:-1], enc.n_values_)
# array([0, 1, 0, 1, 2, 0, 1, 2, 3], dtype=int64)

Also, the original data can be returned from the one-hot array as follows:

x = enc.transform([[0, 1, 1], [1, 2, 3]]).toarray()
# array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.],
#       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.]])

cond = x > 0
[enc.active_features_[c.ravel()] - enc.feature_indices_[:-1] for c in cond]
# [array([0, 1, 1], dtype=int64), array([1, 2, 3], dtype=int64)]

Backtracking categorical features from one-hot-encoding in scikit-learn?

3 Answers