0
votes

We know that in data mining, we often need one-hot encoding to encode categorical features, thus, one categorical feature will be encoded to a few "0/1" features.

There is a special case that confused me: Now I have one categorical feature and one numerical feature in my dataset.I encode the categorical feature to 300 new "0/1" features, and then Normalized the numerical feature using MinMaxScaler, so all my features value is in the range of 0 to 1.But the suspicious phenomenon is that The ratio of categorical feature and numerical feature is seems to changed from 1:1 to 300:1.

Is my method of encoding correct?This made me doubt about one-hot encoding,I think this may lead to the issue of unbalanced features.

Can anybody tell me the truth? Any word will be appreciated! Thanks!!!

1

1 Answers

1
votes

As each record only has one category, only one of them will be 1.

Effectively, with such preprocessing, the weight on the categoricial features will only be about 2 times the weight of a standardized feature. (2 times, if you consider distances and objects of two different categories).

But in essence you are right: one-hot encoding is not particularly smart. It's an ugly hack to make programs run on data they do not support. Things get worse when algorithms such as k-means are used, that assume we can take the mean and need to minimize squared errors on these variables... The statistical value of the results will be limited.