Using a Word2Vec model pre-trained on wikipedia

Question

I need to use gensim to get vector representations of words, and I figure the best thing to use would be a word2vec module that's pre-trained on the english wikipedia corpus. Does anyone know where to download it, how to install it, and how to use gensim to create the vectors?

formi23 formi23 · Accepted Answer · 2017-12-07T02:39:49

You can check WebVectors to find Word2Vec models trained on various corpora. Models come with readme covering the training details. You'll have to be a bit careful using these models, though. I'm not sure about all of them, but at least in Wikipedia's case, the model is not a binary file that you can straightforwardly load using e.g. gensim's functionality, but a txt version, i.e. file with words and corresponding vectors. Keep in mind, though, that the words are appended by their part-of-speech (POS) tags, so for example, if you'd like to use the model to find out similarities for word vacation, you'll get a KeyError if you type vacation as is, since the model stores this word as vacation_NOUN. An example snippet of how you could use the wiki model (perhaps others as well if they're in the same format) and an output is below

import gensim.models

model = "./WebVectors/3/enwiki_5_ner.txt"

word_vectors = gensim.models.KeyedVectors.load_word2vec_format(model, binary=False)
print(word_vectors.most_similar("vacation_NOUN"))
print(word_vectors.most_similar(positive=['woman_NOUN', 'king_NOUN'], negative=['man_NOUN']))

and the output

▶ python3 wiki_model.py
[('vacation_VERB', 0.6829521656036377), ('honeymoon_NOUN', 0.6811978816986084), ('holiday_NOUN', 0.6588436365127563), ('vacationer_NOUN', 0.6212040781974792), ('resort_NOUN', 0.5720850825309753), ('trip_NOUN', 0.5585346817970276), ('holiday_VERB', 0.5482848882675171), ('week-end_NOUN', 0.5174300670623779), ('newlywed_NOUN', 0.5146450996398926), ('honeymoon_VERB', 0.5135983228683472)]
[('monarch_NOUN', 0.6679952144622803), ('ruler_NOUN', 0.6257176995277405), ('regnant_NOUN', 0.6217397451400757), ('royal_ADJ', 0.6212111115455627), ('princess_NOUN', 0.6133661866188049), ('queen_NOUN', 0.6015778183937073), ('kingship_NOUN', 0.5986001491546631), ('prince_NOUN', 0.5900266170501709), ('royal_NOUN', 0.5886058807373047), ('throne_NOUN', 0.5855424404144287)]

UPDATE Here are some useful links to binary models:

Pretrained word embedding models:

Fasttext models:

crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens).
wiki-news-300d-1M.vec.zip: 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
wiki-news-300d-1M-subword.vec.zip: 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
Wiki word vectors, dim=300: wiki.en.zip: bin+text model

Google Word2Vec

Pretrained word/phrase vectors:
- GoogleNews-vectors-negative300.bin.gz
- GoogleNews-vectors-negative300-SLIM.bin.gz: slim version with app. 300k words
Pretrained entity vectors:
- freebase-vectors-skipgram1000.bin.gz: Entity vectors trained on 100B words from various news articles
- freebase-vectors-skipgram1000-en.bin.gz: Entity vectors trained on 100B words from various news articles, using the deprecated /en/ naming (more easily readable); the vectors are sorted by frequency

GloVe: Global Vectors for Word Representation

glove.6B.zip: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download). Here's an example in action.
glove.840B.300d.zip: Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

WebVectors

models trained on various corpora, augmented by Part-of-Speech (POS) tags