4
votes

I am currently using 'InfoGainAttributeEval' for feature selection. I want to know what happens in that method. I found the following.

Evaluates the worth of an attribute by measuring the information gain with respect to the class.

InfoGain(Class,Attribute) = H(Class) - H(Class | Attribute).

As I am new to this area I don't understand what it is. Can someone please explain me how it works? :) What is the difference of this with 'GainRationAttributeEval'

1

1 Answers

9
votes

InfoGainAttributeEval (and GainRatioAttributeEval) are both used for feature selection tasks.

What InfoGainAttributeEval basically does is measuring how each feature contributes in decreasing the overall entropy.

Let's take an example. Say we have this dataset :

------------------------------------------
temperature | wind | class
high        | low  | play
low         | low  | play
high        | low  | play
low         | high | cancelled
low         | low  | play
high        | high | canceled
high        | low  | play

The Entropy, H(X), is defined as follows :

H(X) = -sum(Pi*log2(Pi))

, with Pi being the probability of the class i in the dataset, and log2 the base 2 logarithm (in Weka natural logarithm of base e is used, but generally we take log2). Entropy basically measures the degree of "impurity". The closest to 0 it is, the less impurity there is in your dataset.

Hence, a good attribute is an attribute that contains the most information, i.e, reduces the most the entropy. The InfoGainAttributeEval method of Weka is a way of evaluating exactly this.

Now, the entropy of our example is : H(Class) = -(5/7*log2(5/7)+2/7*log(2/7)) = 0,863.

Let's calculate for our example the amount of information carried by the temperature attribute.

InfoGain(Class,Temperature) = H(Class) - H(Class | Temperature).

To get the H(Class | Temperature), we need to split the dataset according to this attribute.

                            Dataset
                             /   \
                            /     \
                           / TempĀ° \
                          /         \
                         /           \
                        /             \
                      high           low



    temperature | wind | class                 temperature|wind|class
high        | low  | play                      low     | low|play
high        | low  | play                      low     | high|cancelled
high        | high | cancelled                 low     | low |play
high        | low  | play

Each branch here has its own entropy. We need to first calculate the entropy of each split.

H(left_split) = -(3/4log(3/4)+1/4log(1/4) = 0,811
H(right_split) = -(1/3log(1/3)+2/3log(2/3) = 0,918

H(Class | Temperature) is then equals to the sum of both children's entropy, weighted by the proportion of instances that where taken from the parent dataset. In short :

H(Class | Temperature) = 4/7*H(left_split) + 3/7*H(right_split).

You then have everything to calculate the InfoGain. In this example, it's 0,06 bits. This means that the temperature feature only reduces the global entropy by 0,06 bits, the feature's contribution to reduce the entropy (= the information gain) is fairly small.

This is pretty obvious looking at the instances in the dataset, as we can see at a first glance that the temperature doesn't affect much the final class, unlike the wind feature.

As for the GainRatioAttributeEval, it's an enhancement of the InfoGain, with a normalized score.

Hope that helps !

Source for part of this answer : Anuj Sharma and Shubhamoy Dey. Article: Performance Investigation of Feature Selection Methods and Sentiment Lexicons for Sentiment Analysis. IJCA Special Issue on Advanced Computing and Communication Technologies for HPC Applications ACCTHPCA(3):15-20, July 2012.