0
votes

For a research paper, I will be using a lasso model to perform classification and feature selection. I am preparing to use one-hot encoding to process my categorical data and will need to figure out which feature maps to the original categorical values in order to determine which features were ultimately selected for the final model. I've been googling this question for a while but have not found an answer.

How does scikit's one-hot encoding assign values? For example, say my categorical values for a certain variable are {1, 2, 3, 4}. Does one-hot encoding organize them into dummies in chronological order (i.e. drops 1, makes the first dummy for value 2, second dummy for value 3, and third dummy for value 4? Or does it assign based on the order in which it finds different categorical values as it scans down the rows (e.g. the first observation has a value 3 and the second observation has value 2, so 3 is dropped and the first dummy becomes value 2)?

Thanks!

1

1 Answers

1
votes

From a quick look at the source it appears to me that they do end up in order by integer value. However, as this is not documented you can not count on this: it's not part of the contract. If you need to know which value ends up where I suggest writing your own OneHot implementation. Shouldn't be too hard and then you can count on it when you upgrade to new versions, etc.