0
votes

I am using scikit-learn for question classification. I have this code:

print(features[0], '\n')

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')

features = vectorizer.fit_transform(features)

print(features[0], '\n')
selector = SelectPercentile(f_classif, percentile=100)
selector.fit(features, labels)
features = selector.transform(features).toarray()

print(features[0])
print(len(features[0]), '\n')

which produces the following result:

how serfdom develop leav russia ?

(0, 5270)   0.499265751002
(0, 3555)   0.473352969263
(0, 1856)   0.449852125968
(0, 5433)   0.569476725713

[ 0.  0.  0. ...,  0.  0.  0.]
6743

The first question is what does the matrix returned by the tfidfVectorizer mean? The sklearn documentation says:

Learn vocabulary and idf, return term-document matrix. This is equivalent to fit followed by transform, but more efficiently implemented.

From wikipedia:

shows which documents contain which terms and how many times they appear.

The wikipedia example of the matrix is straightforward, but the returned value seems to be something totally different.

Next the SelectPercentile function should return a list of the features, that are the most important, depending on the given percent:

Reduce X to the selected features.

Why do I get 6743 features? :D

P.S.: The program seems to work with 89% accuracy.

EDIT: I am new to python and machine learning, so please explain it like im five.

1
Cross-posted: stats.stackexchange.com/q/245928/2921, stackoverflow.com/q/40595936/781723. Please do not post the same question on multiple sites. Each community should have an honest shot at answering without anybody's time being wasted.D.W.

1 Answers

1
votes

Our computers works on numbers (the only language they understand). So, for processing/analysing our text we need a way to convert that text into numbers. TfIdf (term frequency - inverse document frequency) is one of such methods.

"Term frequency" (Tf) tells the importance of a word in a document based on its frequency in document. But this might be possible that few words which are semantically very important have low frequency. To solve this issue, we use "Inverse Document Frequency" (Idf).

This will be more helpful if you refer the following link, which explains entire Tf-Idf in detail:

https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/