Word sense disambiguation in classification

Question

I am classifying the input sentence to different category. like time, distance, speed, location etc

I trained classifier using MultinomialNB.

Classifier considers mainly tf as feature, I also tried with considering sentence structure (using 1-4 grams)

Using multinomialNB with alpha = 0.001 this is the result for few queries

what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} #for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}}  #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is a meter
{"1": {"period": "90.74%"}, "2": {"dist": "9.26%"}}  #better result should be distance

Using multinomialNW with considering ngram (1-4)

what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}}   # for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}}  #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is an hour
{"1": {"dist": "99.61%"}}   #result should be time

So result purely depends on word occurrence. Is there any way to add word disambiguation(or anyother mean by which somekind of understanding could be brought) here?

I already checked Word sense disambiguation in NLTK Python

but here issue is identifying the main word in sentence, which differs in every sentence.

POS (gives NN,JJ, on which sentence does not rely), NER(highly dependent on capitalization, sometimes ner is also not disambiguating word like "early" ,"cost" in above sentence) I already tried, none of them helps.

**How long some times cosidered as time or distance. So based on sentence near by words, it should able to able understand what it is. Similarly for "how fast, "how come" "how early" [how + word] should be understable**

I am using nltk, scikit learn, python

Update :

40 classes (each with sentence belonging that class)
Total data 300 Kb

Accuracy depends on query. sometimes very good >90%. Sometimes irrelevant class as a result. Depends on how query matches with dataset

What kind of understanding do you want to have? what did you want to achieve with word disambiguation? — user823743
When you used the NB classifier, how many classes did you have and what were they? How large is your dataset? What was the accuracy that you achieved? I'm asking this because based on the information you have given so far, the solution to your problem seems to be unsupervised learning to me. — user823743

tripleee tripleee · Accepted Answer · 2014-12-13T10:40:37

Attempting to deduce semantics purely by looking at individual words out of context is not going to take you very far. In your "watch" examples, the only term which actually indicates that you have "money" semantics is the one you hope to disambiguate. What other information is there in the sentence to help you reach that conclusion, as a human reader? How would you model that knowledge? (A traditional answer would reason about your perception of watches as valuable objects, or something like that.)

Having said that, you might want to look at Wordnet synsets as a possibly useful abstraction. At least then you could say that "cost", "price", and "value" are related somehow, but I suppose the word-level statistics you have already calculated show that they are not fully synonymous, and the variation you see basically accounts for that fact (though your input size sounds kind of small for adequately covering variances of usage patterns for individual word forms).

Another hint could be provided by part of speech annotation. If you know that "value" is used as a noun, that (to my mind, at least) narrows the meaning to "money talk", whereas the verb reading is much less specifically money-oriented ("we value your input", etc). In your other examples, it is harder to see whether it would help at all. Perhaps you could perform a quick experiment with POS-annotated input and see whether it makes a useful difference. (But then POS is not always possible to deduce correctly, for much the same reasons you are having problems now.)

The sentences you show as examples are all rather simple. It would not be very hard to write a restricted parser for a small subset of English where you could actually start to try to make some sense of the input grammatically, if you know that your input will generally be constrained to simple questions with no modal auxiliaries etc.

(Incidentally, I'm not sure "how come can I go to Mumbai" is "manner", if it is grammatical at all. Strictly speaking, you should have subordinate clause word order here. I would understand it to mean roughly "Why is it that I can go to Mumbai?")

Word sense disambiguation in classification

3 Answers