I am trying use the linear SVM and K Neighbors Classifier to do Word sense disambiguation(WSD). Here is a segment of data I am using to train the data:
<corpus lang="English">
<lexelt item="activate.v">
<instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with .
</context>
</instance>
<instance id="activate.v.bnc.00044852" docsrc="BNC">
<answer instance="activate.v.bnc.00044852" senseid="38201"/>
<answer instance="activate.v.bnc.00044852" senseid="38202"/>
<context>
For neurophysiologists and neuropsychologists , the way forward in understanding perception has been to correlate these dimensions of experience with , firstly , the material properties of the experienced object or event ( usually regarded as the stimulus ) and , secondly , the patterns of discharges in the sensory system . Qualitative Aspects of Experience The quality or modality of the experience depends less upon the quality of energy reaching the nervous system than upon which parts of the sensory system are <head>activated</head> : stimulation of the retinal receptors causes an experience of light ; stimulation of the receptors in the inner ear gives rise to the experience of sound ; and so on . Muller 's nineteenth - century doctrine of specific energies formalized the ordinary observation that different sense organs are sensitive to different physical properties of the world and that when they are stimulated , sensations specific to those organs are experienced . It was proposed that there are endings ( or receptors ) within the nervous system which are attuned to specific types of energy , For example , retinal receptors in the eye respond to light energy , cochlear endings in the ear to vibrations in the air , and so on .
</context>
</instance>
.....
The difference between training and test data is that test data don't have the "answer" tag. I have built a dictionary to store the words that are neighbors of the "head" word for each instance with a window size of 10. When there are multiple for one instance, I am only going to consider the first one. I have also built a set to record all the vocabulary in the training file so that I could compute a vector for each instance. For example, if the total vocabulary is [a,b,c,d,e], and one instance has words [a,a,d,d,e], the resulting vector for that instance would be [2,0,0,2,1]. Here's the a segment of the dictionary I built for each word:
{
"activate.v": {
"activate.v.bnc.00024693": {
"instanceId": "activate.v.bnc.00024693",
"senseId": "38201",
"vocab": {
"although": 1,
"back": 1,
"bend": 1,
"bicycl": 1,
"correct": 1,
"dig": 1,
"general": 1,
"handlebar": 1,
"hefti": 1,
"lever": 1,
"nt": 2,
"quit": 1,
"rear": 1,
"spade": 1,
"sprung": 1,
"step": 1,
"type": 1,
"use": 1,
"wo": 1
}
},
"activate.v.bnc.00044852": {
"instanceId": "activate.v.bnc.00044852",
"senseId": "38201",
"vocab": {
"caus": 1,
"ear": 1,
"energi": 1,
"experi": 1,
"inner": 1,
"light": 1,
"nervous": 1,
"part": 1,
"qualiti": 1,
"reach": 1,
"receptor": 2,
"retin": 1,
"sensori": 1,
"stimul": 2,
"system": 2,
"upon": 2
}
},
......
Now, I just need to provide the input to K Neighbors Classifier and Linear SVM from the scikit-learn to train the classifier. But I am just not sure how should I build the feature vector and label for each. My understanding is that label should be a tuple of instance tag and senseid tag in the "answer". But I am not sure about the feature vector then. Should I group all the vectors from the same word that has the same instance tag and senseid tag in the "answer"? But there are around 100 words and hundreds of instances for each word, how am I supposed to deal with that?
Also, vector is one feature, I need to add more features later on, for example, synset, hypernyms, hyponyms etc. How am I supposed to do that?
Thanks in advance!