0
votes

In scikit learn or nltk classifier generally consider term frequency or TF-IDF.

I want to consider term frequency as well, sentence structure for classification. I have 15 categories of question. Each with text file containing sentence with new lines.

Category city contains this sentence:

In which city Obama was born?

If I consider on term frequency, then following might not be considered. because obama or city in dataset do not match with query sentence

1. In which place Hally was born 2. In which city Hally was born?

Is there any classifier which consider both term frequency as well sentence structure. So when trained, it classify input query with similar sentence structure too

2

2 Answers

2
votes

You could train the tf-idf on ngrams as well, in addition to the unigrams. In Scikit Learn you can specify the ngram_range that will be taken into account: if you set it to train on up to 3-grams, you would end up storing the frequency for combinations of words such as "In which place", which is pretty indicative about the type of question that is asked.

1
votes

As drekyn said you can use the Scikit learn for features extraction here are some examples:

>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
...                                     token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
...     ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True

Source