I am using scikit-learn for question classification. I have this code:
print(features[0], '\n')
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
features = vectorizer.fit_transform(features)
print(features[0], '\n')
selector = SelectPercentile(f_classif, percentile=100)
selector.fit(features, labels)
features = selector.transform(features).toarray()
print(features[0])
print(len(features[0]), '\n')
which produces the following result:
how serfdom develop leav russia ?
(0, 5270) 0.499265751002
(0, 3555) 0.473352969263
(0, 1856) 0.449852125968
(0, 5433) 0.569476725713
[ 0. 0. 0. ..., 0. 0. 0.]
6743
The first question is what does the matrix returned by the tfidfVectorizer mean? The sklearn documentation says:
Learn vocabulary and idf, return term-document matrix. This is equivalent to fit followed by transform, but more efficiently implemented.
From wikipedia:
shows which documents contain which terms and how many times they appear.
The wikipedia example of the matrix is straightforward, but the returned value seems to be something totally different.
Next the SelectPercentile function should return a list of the features, that are the most important, depending on the given percent:
Reduce X to the selected features.
Why do I get 6743 features? :D
P.S.: The program seems to work with 89% accuracy.
EDIT: I am new to python and machine learning, so please explain it like im five.