3
votes

Given a set of words V, I would like to group the synonym words in V together. I am wondering if there is any built-in function in NLTK and Wordnet that takes V as the input and automatically cluster them based on synonymity.

I already know how to extract the synonym of each word, but this is not what I am looking for. If I do so, the problem becomes complicated when the synonym sets are intersecting each other, or being subset/superset of each other, which needs writing a function removing the conflicts.

As an example, let's consider

V = ["good","constipate","bad","nice","defective","right","respectable","powerful"]

What I want to get as output is:

[('constipate'), ('nice'), ('bad', 'defective'), ('good', 'powerful', 'respectable', 'right')]

Now based on the size/number of the clusters, some sets might split into several sets, or combine together. Here, I am just caring for the words in V and their synonyms in V.

1
If there's no defined no. of clusters you want, it's harder problem. - alvas
@alvas Ok, if I set the no. of clusters, is there any function doing this clustering? - Mila
Yes you can use k-means but first you have to get from word -> synsets -> synset distance -> cluster based on synset-lemma distance. Which isn't trivial. It's easier to do word2vec or LDA in gensim given a large corpus. - alvas
@alvas thank you for reply. I did it in word2vec and using k-means clustering. I will give a try using synset distance to see how results are different from word2vec... - Mila

1 Answers

0
votes

Yes, there is a way to do using nltk and wordnet. Following is an example. I am using built in sysnets and looking for synonyms for a 'book',

import nltk
from nltk.corpus import wordnet 

synonyms = []

for syn in wordnet.synsets('book'):
        for lemma in syn.lemmas():
            synonyms.append(lemma.name())

resulting synonyms for 'book' is

print(synonyms)
>>['book', 'book', 'volume', 'record', 'record_book', 'book', 'script', 'book', 'playscript', 'ledger', 'leger', 'account_book', 'book_of_account', 'book', 'book', 'book', 'rule_book', 'Koran', 'Quran', "al-Qur'an", 'Book', 'Bible', 'Christian_Bible', ..]

length of synonyms,

 len(synonyms)
 >>38

Note: Some synonyms are verb forms, and many synonyms are just different usages of 'book'. If, instead, we take the set of synonyms, there are fewer unique words, as shown in the following code:

len(set(synonyms)) 
 >>25

After using set operation,

{'record', 'Quran', 'Holy_Scripture', 'Koran', 'Good_Book', 'playscript', 'book', 'Word_of_God', 'hold', 'Holy_Writ', 'script', 'leger', 'book_of_account', 'Scripture', 'ledger', 'reserve', 'volume', 'record_book', "al-Qur'an", 'Christian_Bible', 'Word', 'rule_book', 'Bible', 'Book', 'account_book'}