2
votes

I would like to load pretrained multilingual word embeddings from the fasttext library with gensim; here the link to the embeddings:

https://fasttext.cc/docs/en/crawl-vectors.html

In particular, I would like to load the following word embeddings:

  • cc.de.300.vec (4.4 GB)
  • cc.de.300.bin (7 GB)

Gensim offers the following two options for loading fasttext files:

  1. gensim.models.fasttext.load_facebook_model(path, encoding='utf-8')

    • Load the input-hidden weight matrix from Facebook’s native fasttext .bin output file.
    • load_facebook_model() loads the full model, not just word embeddings, and enables you to continue model training.
  2. gensim.models.fasttext.load_facebook_vectors(path, encoding='utf-8')

    • Load word embeddings from a model saved in Facebook’s native fasttext .bin format.
    • load_facebook_vectors() loads the word embeddings only. Its faster, but does not enable you to continue training.

Source Gensim documentation: https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model

Since my laptop has only 8 GB RAM, I am continuing to get MemoryErrors or the loading takes a very long time (up to several minutes).

Is there an option to load these large models from disk more memory efficient?

1

1 Answers

5
votes

As vectors will typically take at least as much addressable-memory as their on-disk storage, it will be challenging to load fully-functional versions of those vectors into a machine with only 8GB RAM. In particular:

  • once you start doing the most common operation on such vectors – finding lists of the most_similar() words to a target word/vector – the gensim implementation will also want to cache a set of the word-vectors that's been normalized to unit-length – which nearly doubles the required memory

  • current versions of gensim's FastText support (through at least 3.8.1) also waste a bit of memory on some unnecessary allocations (especially in the full-model case)

If you'll only be using the vectors, not doing further training, you'll definitely want to use only the load_facebook_vectors() option.

If you're willing to give up the model's ability to synthesize new vectors for out-of-vocabulary words, not seen during training, then you could choose to load just a subset of the full-word vectors from the plain-text .vec file. For example, to load just the 1st 500K vectors:

from gensim.models.keyedvectors import KeyedVectors
KeyedVectors.load_word2vec_format('cc.de.300.vec', limit=500000)

Because such vectors are typically sorted to put the more-frequently-occurring words first, often discarding the long tail of low-frequency words isn't a big loss.