1
votes

I am still very new to machine learning and trying to figure things out myself. I am using SciKit learn and have a data set of tweets with around 20,000 features (n_features=20,000). So far I achieved a precision, recall and f1 score of around 79%. I would like to use RFECV for feature selection and improve the performance of my model. I have read the SciKit learn documentation but am still a bit confused on how to use RFECV.

This is the code I have so far:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics

# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
    docs_train, docs_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)

tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)

# Create the RFECV object
nb = MultinomialNB(alpha=0.5)

# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=nb, step=1, cv=2, scoring='accuracy')

rfecv.fit(X_tfidf, y_train)
X_rfecv=rfecv.transform(X_tfidf)

print("Optimal number of features : %d" % rfecv.n_features_)

# train classifier
clf = MultinomialNB(alpha=0.5).fit(X_rfecv, y_train)

# test clf on test data

X_test_CV = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_CV)
X_test_rfecv = rfecv.transform(X_test_tfidf)

y_predicted = clf.predict(X_test_rfecv)

#print the mean accuracy on the given test data and labels

print ("Classifier score is: %s " % rfecv.score(X_test_rfecv,y_test))

Three questions:

1) Is this the correct way to use cross validation and RFECV? I am especially interested to know if I am running any risk of overfitting.

2) The accuracy of my model before and after I implemented RFECV with the above code are almost the same (around 78-79%), which puzzles me. I would expect performance to improve by using RFECV. Anything I might have missed here or could do differently to improve the performance of my model?

3) What other feature selection methods could you recommend me to try? I have tried RFE and SelectKBest so far, but they both haven't given me any improvement in terms of model accuracy.

1

1 Answers

5
votes

To answer your questions:

  1. There is a cross-validation built in the RFECV feature selection (hence the name), so you don't really need to have additional cross-validation for this single step. However since I understand you are running several tests, it's good to have an overall cross-validation to ensure you're not overfitting to a specific train-test split. I'd like to mention 2 points here:

    1. I doubt the code behaves exactly like you think it does ;).

         # cross validation
         sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
         for train_index, test_index in sss:
             docs_train, docs_test = X[train_index], X[test_index]
             y_train, y_test = y[train_index], y[test_index]
         # feature extraction
         count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
         X_CV = count_vect.fit_transform(docs_train)
      

    Here we first go through the loop, that has 5 iterations (n_iter parameter in StratifiedShuffleSplit). Then we go out of the loop and we just run all your code with the last values of train_index, test_index. So this is equivalent to a single train-test split where you probably meant to have 5. You should move your code back into the loop if you want it to run like a 'proper' cross validation.

    1. You are worried about overfitting: indeed when 'looking for the best method' the risk exists that we're going to pick the method that works best... only on the small sample we're testing the method on.
      Here the best practice is to have a first train-test split, then to perform cross-validation only using the train set. The test set can be used 'sparingly' when you think you found something, to make sure the scores you get are consistent and you're not overfitting.
      It may look like you're throwing away 30% of your data (your test set), but it's absolutely worth it.
  2. It can be puzzling to see feature selection does not have that big an impact. To introspect a bit more you could look into the evolution of the score with the number of selected features (see the example from the docs).
    That being said, I don't think this is the right use case for RFE. Basically with your code you are eliminating features one by one, which probably takes a long time to run and does not make so much sense when you have 20000 features.

  3. Other feature selection methods: here you mention SelectKBest but you don't tell us which method you use to score your features! SelectKBest will pick the K best features according to a score function. I'm guessing you were using the default which is ok, but it's better to have an idea of what the default does ;).

    I would try SelectPercentile with chi2 as a score function. SelectPercentile is probably a bit more convenient than SelectKBest because if your dataset grows a percentage probably makes more sense than a hardcoded number of features.
    Another example from the docs that does just that (and more).

Additional remarks:

  • You could use a TfidfVectorizer instead of a CountVectorizer followed by a TfidfTransformer. This is strictly equivalent.
  • You could use a pipeline object to pack the different steps of your classifier into a single object you can run cross validation on (I encourage you to read the docs, it's pretty useful).

    from sklearn.feature_selection import chi2_sparse
    from sklearn.feature_selection import SelectPercentile
    from sklearn.pipeline import Pipeline
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    pipeline = Pipeline(steps=[
        ("vectorizer", TfidfVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))),
        ("selector", SelectPercentile(score_func=chi2, percentile=70)),
        ('NB', MultinomialNB(alpha=0.5))
    ])
    

Then you'd be able to run cross validation on the pipeline object to find the best combination of alpha and percentile, which is much harder to do with separate estimators.

Hope this helps, happy learning ;).