1
votes

What I want exactly is to cluster words and phrases, e.g. knitting/knit loom/loom knitting/weaving loom/rainbow loom/home decoration accessories/loom knit/knitting loom/...And I don'd have corpus while I have only the words/phrases. Could I use a pre-trained model like the one from GoogleNews/Wikipedia/... to realise it?

I am trying now to use Gensim to load GoogleNews pre-trained model to get phrases similarity. I've been told that The GoogleNews model includes vectors of phrases and words. But I find that I could only get word-similarity while phrase-similarity fails with an error message that the phrase is not in the vocabulary. Please advise me. Thank you.

import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

GOOGLE_MODEL = '../GoogleNews-vectors-negative300.bin'

model = gensim.models.KeyedVectors.load_word2vec_format(GOOGLE_MODEL, binary=True) 


# done well
model.most_similar("computer", topn=3) 

# done with error message "computer_software" is not in the vocabulory.
model.most_similar("computer_software", topn=3) 
1

1 Answers

0
votes

The GoogleNews set does include many multi-word phrases, as created via some statistical analysis, but might not include something specific you're hoping it does, like 'computer_software'.

On the other hand, I see an online word-list suggesting that a phrase like 'composite_fillings' is in the GoogleNews vocabulary, so this will likely work for you:

model.most_similar("composite_fillings", topn=3) 

With that vector-set, you're limited to what they chose to model as phrases. If you need similarly-strong vectors for other phrases, you'd likely need to train your own model, on a corpus where the phrases important to you have been combined into single tokens. (If you just need something-better-than-nothing, averaging together the constituent words' word-vectors would give you something to work with... but that's a pretty-crude stand-in for truly modeling the bigram/multigram against its unique contexts.)