1
votes

Using in Python 2.7 (miniconda interpreter). Confused by the example below about OneHotEncoder, confused why enc.n_values_ output is [2, 3, 4]? If anyone could help to clarify, it will be great.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'float'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

regards, Lin

2

2 Answers

2
votes

n_values is the number of values per feature.

In this example,

X = 0 0 3
    1 1 0
    0 2 1
    1 0 2

(X's shape is [n_samples, n_feature])

For the first feature, there are 2 values: 0, 1;

For the second feature, there are 3 values: 0, 1, 2.

For the third feature, there are 4 values: 0, 1, 2, 3.

Therefore, enc.n_values_ is [2, 3, 4].

0
votes

I think that the above explanation about n_values_ is not complete and clear.My explanation :-
With this : "n_values[i] is the number of categorical values in X[:, i]. Each feature value should be in range(n_values[i]) " this meant that the n_values_ gives the range(n_values[i]) = [0,n_values) in column.

here :

from sklearn.preprocessing import OneHotEncoder
o = OneHotEncoder()
o.fit([[1,2,3], [1,2,3], [1,2,11] , [2,3,1]])
o.n_values_
array([3,4,12])

Here the result is 3,4,12 so you can see it more clearly by colum each value in array gives the range.

1 2 3
1 2 3
1 2 11
2 3 1

So column wise values are range(3) , range(4), range(12). To understand more clearly lets take one more example : -

from sklearn.preprocessing import OneHotEncoder
o = OneHotEncoder()
o.fit([[1,2,3,100], [1,2,3,200], [1,2,11,300] , [2,3,1,400]])
o.n_values_
array([ 3, 4, 12, 401])

this proved that it looks column wise for the categorical data and each feature values lies in range(n_values_)