5
votes

I'm working with the Python NLTK Wordnet API. I'm trying to find the best synset that represents a group of words.

If I need to find the best synset for something like "school & office supplies", I'm not sure how to go about this. So far I've tried finding the synsets for the individual words and then computing the best lowest common hypernym like this:

def find_best_synset(category_name):
    text = word_tokenize(category_name)
    tags = pos_tag(text)

    node_synsets = []
    for word, tag in tags:
        pos = get_wordnet_pos(tag)
        if not pos:
            continue
        node_synsets.append(wordnet.synsets(word, pos=pos))

    max_score = 0
    max_synset = None
    max_combination = None
    for combination in itertools.product(*node_synsets):
        for test in itertools.combinations(combination, 2):
            score = wordnet.path_similarity(test[0], test[1])
            if score > max_score:
                max_score = score
                max_combination = test
                max_synset = test[0].lowest_common_hypernyms(test[1])
    return max_synset

However this doesn't work very well plus it is very costly. Are there any ways to figure out which synset best represents multiple words together?

Thanks for your help!

1
If all your expressions are like that example, then you probably shouldn't look for a common hyperonym. "School supplies" are a kind of supplies, but they aren't some kind of school. Rather, you could consider the synsets of the last word and disambiguate among those using the preceding words (I'm not sure how to do this, however). - lenz
Hmmm, I see your point but I don't think all the expressions are like that example. I realize that "school & office" are the type of supplies but they are still recognized as nouns instead of adjectives. - kevin.w.johnson
Well, it won't simplify your task if the expressions have different internal structures. I suggest you manually assign the correct synset in a random sample (like 20 to begin with) and then look if you can see a pattern. Or manually do even more instances and train a decision tree. - lenz

1 Answers

4
votes

Apart from what I said in the comments already, I think the way you select the best hyperonym might be flawed. The synset you end up with is not the lowest common hyperonym of all words, but only that of two of them.

Let's stick with your example of "school & office supplies". For each word in the expression you get a number of synsets. So the variable node_synsets will look something like the following:

[[school_1, school_2], [office_1, office_2, office_3], [supply_1]]

In this example, there are 6 ways to combine each synset with any of the others:

[(school_1, office_1, supply_1),
 (school_1, office_2, supply_1),
 (school_1, office_3, supply_1),
 (school_2, office_1, supply_1),
 (school_2, office_2, supply_1),
 (school_2, office_3, supply_1)]

These triples are what you iterate over in the outer for loop (with itertools.product). If the expression has 4 words, you would iterate over quadruples, with 5 it's quintuples, etc.

Now, with the inner for loop, you pair off each triple. The first one is:

[(school_1, office_1),
 (school_1, supply_1),
 (office_1, supply_1)]

... and you determine the lowest hyperonym among each pair. So in the end you get the lowest hyperonym of, say, school_2 and office_1, which might be some kind of institution. This is probably not very meaningful, as it doesn't consider any synset of the last word.

Maybe you should try to find the lowest common hyperonym of all three words, in each combination of their synsets, and take the one scoring best among them.