active_features_ attribute in OneHotEncoder

Question

I am new to machine learning and I am trying to understand what the OneHotEncoder does. I can distinguish it with other things such as LabelEncoder. In particular, I find the documentation on active_features_ particularly confusing.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

It is also mentioned in the doc of feature_indices_

feature_indices_ :
array of shape (n_features,)
Indices to feature ranges. Feature i in the original data is mapped to features from feature_indices_[i] to feature_indices_[i+1] (and then potentially masked by active_features_ afterwards)

What does this mean, what is the mask here for?

Thank you!

Ibraim Ganiev Ibraim Ganiev · Accepted Answer · 2015-11-08T17:41:02

OneHotEncoder encodes categorical feature, (Feature which values are categorical) e.g feature "vehicle" can have value from set {"car", "motorcycle", "truck", ...}. This feature type is used when one implies that you don't have any order between those values, e.g. car is not comparable with motorcycle or truck, though you are encoding set "car", "motorcycle", "truck"} with integers, you want to learn estimator which doesn't imply any relationship between values of categorical feature. To transform this feature type into binary or rational, and still maintain that property of unordered values you can use One Hot Encoding. It's very common technique: instead of each categorical feature in original dataset it will create n new binary features, where n - number of unique values in original categorical feature. If you want to know where those n new binary features is exactly located in resulting dataset - you will have to use feature_indices_ attribute, all new binary features for categorical feature i from original dataset are now in columns feature_indices_[i]:feature_indices_[i+1] of new dataset.

OneHotEncoder determines range of each categorical feature from values of this feature from dataset, look at this example:

dataset = [[0, 0],
           [1, 1],
           [2, 4],
           [0, 5]]

# First categorial feature has values in range [0,2] and dataset contains all values from that range.
# Second feature has values in range [0,5], but values (2, 3) are missing.
# Assuming that one encoded categorial values with that integer range, 2 and 3 must be somewhere, or it's sort of error.
# Thus OneHotEncoder will remove columns of values 2 and 3 from resulting dataset
enc = OneHotEncoder()
enc.fit(dataset)

print(enc.n_values_)
# prints array([3,6])
# first feature has 3 possible values, i.e 3 columns in resulting dataset
# second feature has 6 possible values
print(enc.feature_indices_)
# prints array([0, 3, 9])
# first feature decomposed into 3 columns (0,1,2), second — into 6 (3,4,5,6,7,8)
print(enc.active_features_)
# prints array([0, 1, 2, 3, 4, 7, 8])
# but two values of second feature never occurred, so active features doesn't list (5,6), and resulting dataset will not contain those columns too
enc.transform(dataset).toarray()
# prints this array
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.]])

active_features_ attribute in OneHotEncoder

1 Answers