0
votes

Here is word2vec model=gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True) and it contains words in uppercase. How I can produce new model from this one with all words from it and these words are lowercased? All words would have the same vectors as in source model.

1

1 Answers

2
votes

When you're using a set of pre-trained vectors, like GoogleNews-vectors-negative300.bin.gz, the creator of those vectors determined what words, with what case-handling, are included.

Once loaded, lookup in such a model is by exact, case-sensitive string matching.

There's no built-in capability in Gensim for performing later case-normalization, such as converting all keys to lowercase. And if there was, there would be an open question of how to deal with situations where multiple extant keys would all flatten to the same key.

For example, what if a vector set includes separate vectors for "USA", "Usa", and "usa", but you want a case-insensitive lookup of "usa". Should just one of the vectors be retained, discarding the others? Should the vector returned be some average of the three? What if there's some odd mixed-casing, say "usA", that's late in the list of all vectors (and thus was relatively infrequent in the training data). Should that vector have no weight, lesser weight, or equal weight to whatever casing is most-frequent?

If you know how you'd want to resolve such cases, you could certainly tamper with the model itself to modify its mappings. For example, you could look though the w2v_model.index2entity list, which shows the word in each 'slot' of the model, and modify both that last and the w2v_model.vocab dictionary so that it only included the mappings you'd prefer.