I'm using Linear Regression in scikit-learn and my dataset contains some cateogorical but numerical features. I mean that there are features such as the value of the district where the house is that are expressed by an integer number between 1 and 7: the more this number is high, the more the house is of value. Should I preprocess a feature that expresses a category (the district of the city) using numbers before Linear Regression with encoders such as OneHotEncoder? Or is it compulsory only when the category is expressed by characters? Thank you in advance..
0
votes
consider if you do one-hot encoding, your dimension increases and it will introduce other problems. one technique to solve the categorical variables problem is to group them and compute the mean of the target variable for each group sort them and replace each category with its rank in sorting.
- hatef Alipoor
@HamedAlipour Thank you for your reply. The approach you suggested seems to be very interesting. Do you know where I could find an example of this to understand it better?
- kudu exa
you're welcome, I couldn't find the blog describing this approach but take a look at this brendanhasz.github.io/2019/03/04/target-encoding.html and search for "mean encoding"
- hatef Alipoor
1 Answers
0
votes
If I understand correctly, you don't need to one hot encode these since they are ordinal, i.e. there is meaning in the order. If the numbers were product codes, for example, and there was no sense of 7 being "better than" or "more than" 4, then you would want to one-hot encode those variables, but in this case you would be losing information by one-hot encoding.