Using the feature selection algorithms on one hot encoding might be miss leading due to the relations between the encoded features. For example, if you encode a feature of n values into n features and you have n-1 of the m selected, the last feature is not needed.
Since the number of your features is quite low (~10), feature selection not help you so much since you'll probably be able to reduce only few of them without loosing too much information.
You wrote that the one hot encoding turns the 10 features into 500, meaning that each feature has about 50 values. In this case you might be more interested in discretisation algorithms, manipulating on the values themselves. If there is an implied order on the values, you can use algorithms for continuos features. Another option is simply to omit rare values or values without a strong correlation to the concept.
In case that you use feature selection, most algorithms will work on categorial data but you should beware of corner cases. For example, the mutual information, suggested by @Igor Raush is an excellent measure. However, features with many values tend to have higher entropy than feature withe less values. That in turn might lead into higher mutual information and a bias into features of many values. A way to cope with this problem is to normalize by dividing the mutual information by the feature entropy.
Another set of feature selection algorithms that might help you are the wrappers. They actually delegate the learning to the classification algorithm and therefore they are indifferent of the representation as long as the classification algorithm can cope with it.
mutual_info_score
which supportsdiscrete_features=True
. – Igor Raushdtype=np.str
which contains strings and it would also work). – Igor RaushLabelEncoder
in Scikit-learn or categorical series in Pandas. – Igor Raush