Memory efficiently loading of pretrained word embeddings from fasttext library with gensim

Question

I would like to load pretrained multilingual word embeddings from the fasttext library with gensim; here the link to the embeddings:

https://fasttext.cc/docs/en/crawl-vectors.html

In particular, I would like to load the following word embeddings:

cc.de.300.vec (4.4 GB)
cc.de.300.bin (7 GB)

Gensim offers the following two options for loading fasttext files:

gensim.models.fasttext.load_facebook_model(path, encoding='utf-8')
- Load the input-hidden weight matrix from Facebook’s native fasttext .bin output file.
- load_facebook_model() loads the full model, not just word embeddings, and enables you to continue model training.
gensim.models.fasttext.load_facebook_vectors(path, encoding='utf-8')
- Load word embeddings from a model saved in Facebook’s native fasttext .bin format.
- load_facebook_vectors() loads the word embeddings only. Its faster, but does not enable you to continue training.

Source Gensim documentation: https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model

Since my laptop has only 8 GB RAM, I am continuing to get MemoryErrors or the loading takes a very long time (up to several minutes).

Is there an option to load these large models from disk more memory efficient?

gojomo gojomo · Accepted Answer · 2019-12-11T19:54:18

As vectors will typically take at least as much addressable-memory as their on-disk storage, it will be challenging to load fully-functional versions of those vectors into a machine with only 8GB RAM. In particular:

once you start doing the most common operation on such vectors – finding lists of the most_similar() words to a target word/vector – the gensim implementation will also want to cache a set of the word-vectors that's been normalized to unit-length – which nearly doubles the required memory
current versions of gensim's FastText support (through at least 3.8.1) also waste a bit of memory on some unnecessary allocations (especially in the full-model case)

If you'll only be using the vectors, not doing further training, you'll definitely want to use only the load_facebook_vectors() option.

If you're willing to give up the model's ability to synthesize new vectors for out-of-vocabulary words, not seen during training, then you could choose to load just a subset of the full-word vectors from the plain-text .vec file. For example, to load just the 1st 500K vectors:

from gensim.models.keyedvectors import KeyedVectors
KeyedVectors.load_word2vec_format('cc.de.300.vec', limit=500000)

Because such vectors are typically sorted to put the more-frequently-occurring words first, often discarding the long tail of low-frequency words isn't a big loss.

Memory efficiently loading of pretrained word embeddings from fasttext library with gensim

1 Answers