I am trying to construct a lasso regression prediction model. I encoded all my categorical integer features using a one-hot aka one-of-K scheme using OneHotEncoder in scikit-learn. Based on the result, only 51 parameters actually influence the prediction model. I want to investigate these parameters, but they are encoded as described above. Do you know how can I extract which categorical integer feature corresponds to which one hot encoded array? Thanks!
3 Answers
0
votes
Using the active_features_, feature_indices_, and n_values_ attributes of sklearn.preprocessing.OneHotEncoder, a vector of the categorical features ordered by their 'position' in the one-hot array can be created as follows:
import numpy as np
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
enc.active_features_ - np.repeat(enc.feature_indices_[:-1], enc.n_values_)
# array([0, 1, 0, 1, 2, 0, 1, 2, 3], dtype=int64)
Also, the original data can be returned from the one-hot array as follows:
x = enc.transform([[0, 1, 1], [1, 2, 3]]).toarray()
# array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.],
# [ 0., 1., 0., 0., 1., 0., 0., 0., 1.]])
cond = x > 0
[enc.active_features_[c.ravel()] - enc.feature_indices_[:-1] for c in cond]
# [array([0, 1, 1], dtype=int64), array([1, 2, 3], dtype=int64)]
0
votes
-1
votes
I designed ple to enhance sklearn's Pipeline and FeatureUnion, by which we can also backtrack categorical features after one-hot-encoding or other preprocessing steps. Furthermore, we can 'draw' the transform by GraphX: for example,

You can find ple on my Github page.
feature_indices_attribute. - hellpanderr