16
votes

I need to use gensim to get vector representations of words, and I figure the best thing to use would be a word2vec module that's pre-trained on the english wikipedia corpus. Does anyone know where to download it, how to install it, and how to use gensim to create the vectors?

2
Have you seen this page before?imanzabet
This link also might be helpfulimanzabet

2 Answers

18
votes

You can check WebVectors to find Word2Vec models trained on various corpora. Models come with readme covering the training details. You'll have to be a bit careful using these models, though. I'm not sure about all of them, but at least in Wikipedia's case, the model is not a binary file that you can straightforwardly load using e.g. gensim's functionality, but a txt version, i.e. file with words and corresponding vectors. Keep in mind, though, that the words are appended by their part-of-speech (POS) tags, so for example, if you'd like to use the model to find out similarities for word vacation, you'll get a KeyError if you type vacation as is, since the model stores this word as vacation_NOUN. An example snippet of how you could use the wiki model (perhaps others as well if they're in the same format) and an output is below

import gensim.models

model = "./WebVectors/3/enwiki_5_ner.txt"

word_vectors = gensim.models.KeyedVectors.load_word2vec_format(model, binary=False)
print(word_vectors.most_similar("vacation_NOUN"))
print(word_vectors.most_similar(positive=['woman_NOUN', 'king_NOUN'], negative=['man_NOUN']))

and the output

▶ python3 wiki_model.py
[('vacation_VERB', 0.6829521656036377), ('honeymoon_NOUN', 0.6811978816986084), ('holiday_NOUN', 0.6588436365127563), ('vacationer_NOUN', 0.6212040781974792), ('resort_NOUN', 0.5720850825309753), ('trip_NOUN', 0.5585346817970276), ('holiday_VERB', 0.5482848882675171), ('week-end_NOUN', 0.5174300670623779), ('newlywed_NOUN', 0.5146450996398926), ('honeymoon_VERB', 0.5135983228683472)]
[('monarch_NOUN', 0.6679952144622803), ('ruler_NOUN', 0.6257176995277405), ('regnant_NOUN', 0.6217397451400757), ('royal_ADJ', 0.6212111115455627), ('princess_NOUN', 0.6133661866188049), ('queen_NOUN', 0.6015778183937073), ('kingship_NOUN', 0.5986001491546631), ('prince_NOUN', 0.5900266170501709), ('royal_NOUN', 0.5886058807373047), ('throne_NOUN', 0.5855424404144287)]

UPDATE Here are some useful links to binary models:

Pretrained word embedding models:

Fasttext models:

Google Word2Vec

GloVe: Global Vectors for Word Representation

  • glove.6B.zip: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download). Here's an example in action.
  • glove.840B.300d.zip: Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

WebVectors

  • models trained on various corpora, augmented by Part-of-Speech (POS) tags
2
votes

@imanzabet provided useful links with pre-trained vectors, but if you want to train the models yourself using genism than you need to do two things:

  1. Acquire the Wikipedia data, which you can access here. Looks like the most recent snapshot of English Wikipedia was on the 20th, and it can be found here. I believe the other English-language "wikis" e.g. quotes are captured separately, so if you want to include them you'll need to download those as well.

  2. Load the data and use it to generate the models. That's a fairly broad question, so I'll just link you to the excellent genism documentation and word2vec tutorial.

Finally, I'll point out that there seems to be a blog post describing precisely your use case.