How to use deepset's word embedding pre-trained models using gensim?

Question

I was trying to understand the word2vec and I've decided to give it a go with a German word2vec model. Then I found deepset's page about their pre-trained models but I didn't understand how to use (load) word2vec model. I was expecting a single file but there are "Vectors" and "Vocab" text files. How can I use these files to load a pre-trained model using gensim (or any other tool)?

UPDATE: I've tried the answer of @gojomo and I received this error:

Traceback (most recent call last):
  File "/home/bugra/word2vec_imp/pretrained_models/testtt.py", line 11, in <module>
    binary=False)
  File "/home/bugra/word2vec_imp/project_envv/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1549, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/home/bugra/word2vec_imp/project_envv/lib/python3.7/site-packages/gensim/models/utils_any2vec.py", line 277, in _load_word2vec_format
    vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
  File "/home/bugra/word2vec_imp/project_envv/lib/python3.7/site-packages/gensim/models/utils_any2vec.py", line 277, in <genexpr>
    vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
ValueError: invalid literal for int() with base 10: "b'UNK'"

So, in the Traceack, vocab_size, vector_size = (int(x) for x in header.split()) the header is the first line of the Vector's text from the gensim's page. And it looks like this:

b'UNK' -0.07903 0.01641 0.006979 -0.035038 0.006474 0.002469 -0.050103 0.142654 -0.03505 0.003106 -0.021312 0.094076 -0.018255 -0.098097 0.087143 0.105799 0.008606 -0.001315 0.069005 0.062015 0.019944 -0.007749 -0.007412 0.050015 -0.083615 0.007712 0.033161 0.017965 -0.06154 -0.017696 0.061967 0.053028 0.038143 -0.07057 0.01561 0.019588 -0.041708 0.034371 -0.066838 -0.059769 0.075711 -0.114826 0.014009 0.050187 -0.01899 -0.076014 -0.052502 0.086082 0.049812 0.008456 -0.01283 0.039918 -0.001924 -0.003752 0.031073 0.034325 0.040086 0.078946 -0.012194 0.056323 0.126129 -0.024503 0.026304 -0.074797 -0.098972 0.003672 0.051386 -0.017574 -0.050253 -0.07677 0.004362 -0.069935 -0.048108 0.020127 0.007066 -0.024247 0.041911 0.03377 -0.011906 -0.0168 -0.00355 -0.003168 0.05164 -0.055769 0.01488 -6e-06 0.094575 -0.066246 -0.111004 -0.031954 0.006958 0.005259 0.15825 0.102919 0.010383 -0.064236 -0.037729 -0.031751 -0.069492 -0.004198 -0.034654 -0.060518 -0.046611 -0.048463 -0.010096 -0.057894 -0.046687 0.062827 0.016907 0.096869 -0.036037 -0.106403 0.056466 0.095621 -0.046383 0.090213 -0.019204 -0.116271 -0.00824 -0.017732 0.037387 -0.021405 -0.040493 -0.059114 0.12289 0.032563 0.103712 0.072411 -0.106944 -0.110485 -0.027564 0.023977 -0.048099 0.036966 -0.11356 -0.009166 0.074402 0.128162 0.080086 0.112749 0.050494 0.064998 0.089217 0.029182 -0.07277 0.058653 0.061047 -0.05293 -0.01979 0.107459 0.002719 -0.008774 -0.098009 0.009321 0.099869 0.024181 -0.071247 -0.054372 0.019997 0.024442 0.108639 0.053727 -0.089804 0.118491 -0.044407 -0.045336 0.078483 0.059462 -0.012287 0.028941 0.064551 0.066738 0.029614 0.092768 0.021783 -0.018141 -0.032692 0.000178 0.021413 0.044657 -0.041903 0.027439 -0.029112 -0.027419 -0.091497 0.00712 -0.076297 -0.097602 -0.098875 -0.067403 -0.015912 0.055845 0.057585 -0.061145 -0.006828 0.044573 0.049632 0.014541 -0.024579 -0.045455 0.095474 -0.02978 -0.060053 -0.005672 -0.002711 0.059481 -0.060563 0.047562 -0.086001 0.064536 0.196527 -0.105742 -0.019043 0.038534 -0.099681 0.031009 -0.020548 -0.058781 0.064247 0.008213 0.126322 0.029859 0.013129 -0.021303 0.043993 0.033347 0.020245 0.037738 -0.02178 0.027693 -0.07024 0.004687 0.045271 -0.022966 0.014069 0.022861 -0.02787 0.082912 -0.049544 0.016079 -0.004684 0.000572 0.077382 0.036401 0.054974 -0.039538 0.002119 0.034002 -0.008836 -0.014758 0.00959 -0.064647 -0.034766 0.016912 -0.036381 -0.037106 0.073451 -0.098941 -0.092281 -0.018656 0.050538 0.041422 0.041235 0.011248 -0.106058 0.066443 0.083865 0.094636 0.004414 -0.092855 -0.027255 0.005234 0.066584 0.055394 0.023019 -0.001949 -0.066794 -0.064739 0.038924 -0.016647 0.000555 0.02428 0.016469 -0.0467 -0.035343 -0.066789 -0.025929 -0.023397 0.062855 0.020142 -0.047568 0.010299 -0.021509 -0.02826 0.029225 0.01803 0.024336 0.018226 -0.009453 -0.068584

Any help would be appreciated.

gojomo gojomo · Accepted Answer · 2020-12-03T17:36:59

In Gensim, starting from a properly-formatted plain-text file of vectors with associated vocab.txt, you could try:

from gensim.models import KeyedVectors

vecs = KeyedVectors.load_word2vec_format('vectors.txt', fvocab='vocab.txt', binary=False)

However, per followup discussion below, it appears those files from Deepset aren't currently properly-formatted.

Those should be fixed by Deepset, or you should look to some other source for properly-formatted word-vector files, such as Facebook's Wikipedia-trained vectors for many languages.)

How to use deepset's word embedding pre-trained models using gensim?

3 Answers