0
votes

I have a question about fasttext (https://fasttext.cc/). I want to download a pre-trained model and use it to retrieve the word vectors from text.

After downloading the pre-trained model (https://fasttext.cc/docs/en/english-vectors.html) I unzipped it and got a .vec file. How do I import this into fasttext?

I've tried to use the mentioned function as follows:

import fasttext
import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

vectors = load_vectors('/Users/username/Downloads/wiki-news-300d-1M.vec')
model = fasttext.load_model(vectors)

However, I can't completely run this code because python crashes. How can I successfully load these pre-trained word vectors?

Thank you for your help.

1
Pleas edit your question to specify whether there is an error message.ygorg
How big is the vector file? How much RAM does your machine have?dennlinger

1 Answers

3
votes

FastText's advantage over word2vec or glove for example is that they use subword information to return vectors for OOV (out-of-vocabulary) words.

So they offer two types of pretrained models : .vec and .bin.

.vec is a dictionary Dict[word, vector], the word vectors are pre-computed for the words in the training vocabulary.

.bin is a binary fasttext model that can be loaded using fasttext.load_model('file.bin') and that can provide word vector for unseen words (OOV), be trained more, etc.

In your case you are loading a .vec file, so vectors is the "final form" of the data. fasttext.load_model expects a .bin file.

If you need more than a python dictionary you can use gensim.models.keyedvector (which handles any word vectors, such as word2vec, glove, etc...).