0
votes

I created a table to test my understanding

    F1  F2  Outcome
0   2   5        1
1   4   8        2
2   6   0        3
3   9   8        4
4  10   6        5

From F1 and F2 I tried to predict Outcome

As you can see F1 have a strong correlation to Outcome,F2 is random noise

I tested

pca = PCA(n_components=2)
fit = pca.fit(X)
print("Explained Variance")
print(fit.explained_variance_ratio_)
Explained Variance
[ 0.57554896  0.42445104]

Which is what I expected and shows that F1 is more important

However when I do RFE (Recursive Feature Elimination)

model = LogisticRegression()
rfe = RFE(model, 1)
fit = rfe.fit(X, Y)
print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)
1
[False  True]
[2 1]

It asked me to keep F2 instead? It should ask me to keep F1 since F1 is a strong predictor while F2 is random noise... why F2?

Thanks

1

1 Answers

1
votes

You are using LogisticRegression model. This is a classifier, not a regressor. So your outcome here is treated as labels (not numbers). For good training and prediction, a classifier needs multiple samples of each class. But in your data, only single row is present for each class. Hence the results are garbage and not to be taken seriously.

Try replacing that with any regression model and you will see the outcome which you thought would be.

model = LinearRegression()
rfe = RFE(model, 1)
fit = rfe.fit(X, y)

print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)

# Output
1
[ True False]
[1 2]