6
votes

I am training a neural network which has 10 or so categorical inputs. After one-hot encoding these categorical inputs I end up feeding around 500 inputs into the network.

I would love to be able to ascertain the importance of each of my categorical inputs. Scikit-learn has numerous feature importance algorithms, however can any of these be applied to categorical data inputs? All of the examples use numerical inputs.

I could apply these methods to the one-hot encoded inputs, but how would I extract the meaning after applying to binarised inputs? How does one go about judging feature importance on categorical inputs?

1
I've successfully used mutual_info_score which supports discrete_features=True.Igor Raush
@A555h5 seems that it doesn't actually need to be a Numpy array, the list you gave works just fine as an input (although you could use a Numpy array with dtype=np.str which contains strings and it would also work).Igor Raush
In general, for situations like this, you would use an index encoding where each level of the categorical feature is mapped to an integer 0, 1, etc. Take a look at LabelEncoder in Scikit-learn or categorical series in Pandas.Igor Raush
In response to your question to Vivek, it depends on what you're trying to accomplish. You can use an importance metric to prune entire features ("feature selection"), or you can one-hot encode them and prune only certain levels ("value selection"). I've seen both ways used.Igor Raush

1 Answers

2
votes

Using the feature selection algorithms on one hot encoding might be miss leading due to the relations between the encoded features. For example, if you encode a feature of n values into n features and you have n-1 of the m selected, the last feature is not needed.

Since the number of your features is quite low (~10), feature selection not help you so much since you'll probably be able to reduce only few of them without loosing too much information.

You wrote that the one hot encoding turns the 10 features into 500, meaning that each feature has about 50 values. In this case you might be more interested in discretisation algorithms, manipulating on the values themselves. If there is an implied order on the values, you can use algorithms for continuos features. Another option is simply to omit rare values or values without a strong correlation to the concept.

In case that you use feature selection, most algorithms will work on categorial data but you should beware of corner cases. For example, the mutual information, suggested by @Igor Raush is an excellent measure. However, features with many values tend to have higher entropy than feature withe less values. That in turn might lead into higher mutual information and a bias into features of many values. A way to cope with this problem is to normalize by dividing the mutual information by the feature entropy.

Another set of feature selection algorithms that might help you are the wrappers. They actually delegate the learning to the classification algorithm and therefore they are indifferent of the representation as long as the classification algorithm can cope with it.