I am classifying the input sentence to different category. like time, distance, speed, location etc
I trained classifier using MultinomialNB.
Classifier considers mainly tf
as feature, I also tried with considering sentence structure (using 1-4 grams)
Using multinomialNB
with alpha
= 0.001 this is the result for few queries
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} #for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is a meter
{"1": {"period": "90.74%"}, "2": {"dist": "9.26%"}} #better result should be distance
Using multinomialNW
with considering ngram
(1-4)
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} # for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is an hour
{"1": {"dist": "99.61%"}} #result should be time
So result purely depends on word occurrence. Is there any way to add word disambiguation(or anyother mean by which somekind of understanding could be brought) here?
I already checked Word sense disambiguation in NLTK Python
but here issue is identifying the main word in sentence, which differs in every sentence.
POS
(gives NN,JJ, on which sentence does not rely), NER
(highly dependent on capitalization, sometimes ner is also not disambiguating word like "early" ,"cost" in above sentence) I already tried, none of them helps.
**How long some times cosidered as time or distance. So based on sentence near by words, it should able to able understand what it is. Similarly for "how fast, "how come" "how early" [how + word] should be understable**
I am using nltk, scikit learn, python
Update :
- 40 classes (each with sentence belonging that class)
- Total data 300 Kb
Accuracy depends on query. sometimes very good >90%. Sometimes irrelevant class as a result. Depends on how query matches with dataset