OneHotEncoder encodes categorical feature, (Feature which values are categorical) e.g feature "vehicle" can have value from set {"car", "motorcycle", "truck", ...}. This feature type is used when one implies that you don't have any order between those values, e.g. car is not comparable with motorcycle or truck, though you are encoding set "car", "motorcycle", "truck"} with integers, you want to learn estimator which doesn't imply any relationship between values of categorical feature. To transform this feature type into binary or rational, and still maintain that property of unordered values you can use One Hot Encoding. It's very common technique: instead of each categorical feature in original dataset it will create n
new binary features, where n
- number of unique values in original categorical feature. If you want to know where those n new binary features is exactly located in resulting dataset - you will have to use feature_indices_
attribute, all new binary features for categorical feature i
from original dataset are now in columns feature_indices_[i]:feature_indices_[i+1]
of new dataset.
OneHotEncoder determines range of each categorical feature from values of this feature from dataset, look at this example:
dataset = [[0, 0],
[1, 1],
[2, 4],
[0, 5]]
# First categorial feature has values in range [0,2] and dataset contains all values from that range.
# Second feature has values in range [0,5], but values (2, 3) are missing.
# Assuming that one encoded categorial values with that integer range, 2 and 3 must be somewhere, or it's sort of error.
# Thus OneHotEncoder will remove columns of values 2 and 3 from resulting dataset
enc = OneHotEncoder()
enc.fit(dataset)
print(enc.n_values_)
# prints array([3,6])
# first feature has 3 possible values, i.e 3 columns in resulting dataset
# second feature has 6 possible values
print(enc.feature_indices_)
# prints array([0, 3, 9])
# first feature decomposed into 3 columns (0,1,2), second — into 6 (3,4,5,6,7,8)
print(enc.active_features_)
# prints array([0, 1, 2, 3, 4, 7, 8])
# but two values of second feature never occurred, so active features doesn't list (5,6), and resulting dataset will not contain those columns too
enc.transform(dataset).toarray()
# prints this array
array([[ 1., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 0., 0., 1., 0., 0.],
[ 0., 0., 1., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0., 1.]])