0
votes

I would like to plot the "Recursive feature elimination with cross-validation" using a Decision Tree and kNN in SciKitLearn, as documented here

I would like to implement this in the classifiers that I am already working with to output both results at the same time. However, it keeps giving me an error.

This is the code that I have modified for a DT:

from collections import defaultdict

import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sk.learn.feature_selection import RFECV
from sklearn.metrics import zero_one_loss


from scipy.sparse import csr_matrix

lemma2feat = defaultdict(lambda: defaultdict(float))  # { lemma: {feat : weight}}
lemma2cat = dict()
features = set()


with open("input.csv","rb") as infile:
    for line in infile:
        lemma, feature, weight, tClass = line.split()
        lemma2feat[lemma][feature] = float(weight)
        lemma2cat[lemma] = int(tClass)
        features.add(feature)


sorted_rows = sorted(lemma2feat.keys())
col2index = dict()
for colIdx, col in enumerate(sorted(list(features))):
    col2index[col] = colIdx

dMat = np.zeros((len(sorted_rows), len(col2index.keys())), dtype = float)


# populate matrix
for vIdx, vector in enumerate(sorted_rows):
    for feature in lemma2feat[vector].keys():
        dMat[vIdx][col2index[feature]] = lemma2feat[vector][feature]


# sort targ. results.


res = []
for lem in sorted_rows:
    res.append(lemma2cat[lem])


clf = DecisionTreeClassifier(random_state=0)
rfecv = RFECV(estimator=DecisionTreeClassifier, step1, cv=10, 
              scoring='accuracy')
rfecv.fit(dMat)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation score (nb of misclassifications)")
pl.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
pl.show()

print "Acc:"
print cross_val_score(clf, dMat, np.asarray(res), cv=10, scoring = "accuracy")

The error begins at line 56, more specifically: rfecv = RFECV(estimator=DecisionTreeClassifier, step1, cv=10, SyntaxError: non-keyword arg after keyword arg

Can anyone provide insight on how to correct my code in order to implement this function with at least the DT?

The response below from ogrisel seemed to solve the problem with the argument, however provoked the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda/python.app/Contents/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
    execfile(filename, namespace)
  File "input.py", line 58, in <module>
    rfecv.fit(col_index, rows)
  File "/anaconda/python.app/Contents/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 321, in fit
    X, y = check_arrays(X, y, sparse_format="csr")
  File "/anaconda/python.app/Contents/lib/python2.7/site-packages/sklearn/utils/validation.py", line 211, in check_arrays
    % (size, n_samples))
ValueError: Found array with dim 267. Expected 16

It seems that the RFE is reading the input file format in the opposite (as my input contains 16 features with 267 targets). In this way, how is it possible to correctly provide the dims into the code?

Thank you.

1

1 Answers

1
votes

SyntaxError: non-keyword arg after keyword arg is quite explicit: you cannot pass non keyword parameter (e.g. step1) after a keyword parameter estimator=DecisionTreeClassifier.

So the correct syntax in this case is to drop the estimator= prefix for the first arg:

rfecv = RFECV(DecisionTreeClassifier, step1, cv=10, 
              scoring='accuracy')

Now you will get another error: RFECV expects an instance of a model instead of a class as a first argument. To use the default decision tree parameters just use:

rfecv = RFECV(DecisionTreeClassifier(), step1, cv=10, 
              scoring='accuracy')