To understand what's happening, it helps to understand first what the two feature selection methods are doing.
The information gain of an attribute tells you how much information with respect to the classification target the attribute gives you. That is, it measures the difference in information between the cases where you know the value of the attribute and where you don't know the value of the attribute. A common measure for the information is Shannon entropy, although any measure that allows to quantify the information content of a message will do.
So the information gain depends on two things: how much information was available before knowing the attribute value, and how much was available after. For example, if your data contains only one class, you already know what the class is without having seen any attribute values and the information gain will always be 0. If, on the other hand, you have no information to start with (because the classes you want to predict are represented in equal quantities in your data), and an attribute splits the data perfectly into the classes, its information gain will be 1.
The important thing to note in this context is that the information gain is a purely information-theoretic measure, it does not consider any actual classification algorithms.
This is what the wrapper method does differently. Instead of analyzing the attributes and targets from an information-theoretic point of view, it uses an actual classification algorithm to build a model with a subset of the attributes and then evaluates the performance of this model. It then tries a different subset of attributes and does the same thing again. The subset for which the trained model exhibits the best empirical performance wins.
There are a number of reasons why the two methods would give you different results (this list is not exhaustive):
- A classification algorithm may not be able to leverage all the information that the attributes can provide.
- A classification algorithm may implement its own attribute selection internally (for example decision tree/forest learners do this) that considers a smaller subset than attribute selection will yield.
- Individual attributes may not be informative, but combinations of them may be (for example perhaps
a
and b
has no information separately, but a*b
on the other hand, might). Attribute selection will not discover this because it evaluates attributes in isolation, while a classification algorithm may be able to leverage this.
- Attribute selection does not consider the attributes sequentially. Decision trees for example use a sequence of attributes and while
b
may provide information on its own, it may not provide any information in addition to a
, which is used higher up in the tree. Therefore b
would appear useful when evaluated according to information gain, but is not used by a tree that "knows" a
first.
In practice it's usually a better idea to use a wrapper for attribute selection as it takes the performance of the actual classifier you want to use into account, and different classifier vary widely in usage of information. The advantage of classifier-agnostic measures like information gain is that they are much cheaper to compute.