OneHotEncoder encodes categorical feature, (Feature which values are categorical) e.g feature "vehicle" can have value from set {"car", "motorcycle", "truck", ...}. This feature type is used when one implies that you don't have any order between those values, e.g. car is not comparable with motorcycle or truck, though you are encoding set "car", "motorcycle", "truck"} with integers, you want to learn estimator which doesn't imply any relationship between values of categorical feature. To transform this feature type into binary or rational, and still maintain that property of unordered values you can use One Hot Encoding. It's very common technique: instead of each categorical feature in original dataset it will create n
new binary features, where n
- number of unique values in original categorical feature. If you want to know where those n new binary features is exactly located in resulting dataset - you will have to use feature_indices_
attribute, all new binary features for categorical feature i
from original dataset are now in columns feature_indices_[i]:feature_indices_[i+1]
of new dataset.
OneHotEncoder determines range of each categorical feature from values of this feature from dataset, look at this example:
dataset = [[0, 0],
[1, 1],
[2, 4],
[0, 5]]
enc = OneHotEncoder()
enc.fit(dataset)
print(enc.n_values_)
print(enc.feature_indices_)
print(enc.active_features_)
enc.transform(dataset).toarray()
array([[ 1., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 0., 0., 1., 0., 0.],
[ 0., 0., 1., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0., 1.]])