2
votes

I was trying to understand the word2vec and I've decided to give it a go with a German word2vec model. Then I found deepset's page about their pre-trained models but I didn't understand how to use (load) word2vec model. I was expecting a single file but there are "Vectors" and "Vocab" text files. How can I use these files to load a pre-trained model using gensim (or any other tool)?

UPDATE: I've tried the answer of @gojomo and I received this error:

Traceback (most recent call last):
  File "/home/bugra/word2vec_imp/pretrained_models/testtt.py", line 11, in <module>
    binary=False)
  File "/home/bugra/word2vec_imp/project_envv/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1549, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/home/bugra/word2vec_imp/project_envv/lib/python3.7/site-packages/gensim/models/utils_any2vec.py", line 277, in _load_word2vec_format
    vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
  File "/home/bugra/word2vec_imp/project_envv/lib/python3.7/site-packages/gensim/models/utils_any2vec.py", line 277, in <genexpr>
    vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
ValueError: invalid literal for int() with base 10: "b'UNK'"

So, in the Traceack, vocab_size, vector_size = (int(x) for x in header.split()) the header is the first line of the Vector's text from the gensim's page. And it looks like this:

b'UNK' -0.07903 0.01641 0.006979 -0.035038 0.006474 0.002469 -0.050103 0.142654 -0.03505 0.003106 -0.021312 0.094076 -0.018255 -0.098097 0.087143 0.105799 0.008606 -0.001315 0.069005 0.062015 0.019944 -0.007749 -0.007412 0.050015 -0.083615 0.007712 0.033161 0.017965 -0.06154 -0.017696 0.061967 0.053028 0.038143 -0.07057 0.01561 0.019588 -0.041708 0.034371 -0.066838 -0.059769 0.075711 -0.114826 0.014009 0.050187 -0.01899 -0.076014 -0.052502 0.086082 0.049812 0.008456 -0.01283 0.039918 -0.001924 -0.003752 0.031073 0.034325 0.040086 0.078946 -0.012194 0.056323 0.126129 -0.024503 0.026304 -0.074797 -0.098972 0.003672 0.051386 -0.017574 -0.050253 -0.07677 0.004362 -0.069935 -0.048108 0.020127 0.007066 -0.024247 0.041911 0.03377 -0.011906 -0.0168 -0.00355 -0.003168 0.05164 -0.055769 0.01488 -6e-06 0.094575 -0.066246 -0.111004 -0.031954 0.006958 0.005259 0.15825 0.102919 0.010383 -0.064236 -0.037729 -0.031751 -0.069492 -0.004198 -0.034654 -0.060518 -0.046611 -0.048463 -0.010096 -0.057894 -0.046687 0.062827 0.016907 0.096869 -0.036037 -0.106403 0.056466 0.095621 -0.046383 0.090213 -0.019204 -0.116271 -0.00824 -0.017732 0.037387 -0.021405 -0.040493 -0.059114 0.12289 0.032563 0.103712 0.072411 -0.106944 -0.110485 -0.027564 0.023977 -0.048099 0.036966 -0.11356 -0.009166 0.074402 0.128162 0.080086 0.112749 0.050494 0.064998 0.089217 0.029182 -0.07277 0.058653 0.061047 -0.05293 -0.01979 0.107459 0.002719 -0.008774 -0.098009 0.009321 0.099869 0.024181 -0.071247 -0.054372 0.019997 0.024442 0.108639 0.053727 -0.089804 0.118491 -0.044407 -0.045336 0.078483 0.059462 -0.012287 0.028941 0.064551 0.066738 0.029614 0.092768 0.021783 -0.018141 -0.032692 0.000178 0.021413 0.044657 -0.041903 0.027439 -0.029112 -0.027419 -0.091497 0.00712 -0.076297 -0.097602 -0.098875 -0.067403 -0.015912 0.055845 0.057585 -0.061145 -0.006828 0.044573 0.049632 0.014541 -0.024579 -0.045455 0.095474 -0.02978 -0.060053 -0.005672 -0.002711 0.059481 -0.060563 0.047562 -0.086001 0.064536 0.196527 -0.105742 -0.019043 0.038534 -0.099681 0.031009 -0.020548 -0.058781 0.064247 0.008213 0.126322 0.029859 0.013129 -0.021303 0.043993 0.033347 0.020245 0.037738 -0.02178 0.027693 -0.07024 0.004687 0.045271 -0.022966 0.014069 0.022861 -0.02787 0.082912 -0.049544 0.016079 -0.004684 0.000572 0.077382 0.036401 0.054974 -0.039538 0.002119 0.034002 -0.008836 -0.014758 0.00959 -0.064647 -0.034766 0.016912 -0.036381 -0.037106 0.073451 -0.098941 -0.092281 -0.018656 0.050538 0.041422 0.041235 0.011248 -0.106058 0.066443 0.083865 0.094636 0.004414 -0.092855 -0.027255 0.005234 0.066584 0.055394 0.023019 -0.001949 -0.066794 -0.064739 0.038924 -0.016647 0.000555 0.02428 0.016469 -0.0467 -0.035343 -0.066789 -0.025929 -0.023397 0.062855 0.020142 -0.047568 0.010299 -0.021509 -0.02826 0.029225 0.01803 0.024336 0.018226 -0.009453 -0.068584

Any help would be appreciated.

3

3 Answers

2
votes

In Gensim, starting from a properly-formatted plain-text file of vectors with associated vocab.txt, you could try:

from gensim.models import KeyedVectors

vecs = KeyedVectors.load_word2vec_format('vectors.txt', fvocab='vocab.txt', binary=False)

However, per followup discussion below, it appears those files from Deepset aren't currently properly-formatted.

Those should be fixed by Deepset, or you should look to some other source for properly-formatted word-vector files, such as Facebook's Wikipedia-trained vectors for many languages.)

1
votes

That are the output of a word2vect model, not the trained model

They have shared the code for train the embedding in three different github repositories. They work with docker on AWS instances (EC2) to make the training

https://gitlab.com/deepset-ai/open-source/word2vec-embeddings-de

From the repositories, these are the steps:

  • Download latest articles from german wiki
  • Convert them with gensim into text, single words seperated by space, without punctuation
  • Train word2vec and save the resulting embedding files in the mounted OUTPUTDIR

They use docker to run the models.

0
votes

If you want to use gensim than below recipe may help you.

https://github.com/devmount/GermanWordEmbeddings

You can simply download the latest wiki base data. If you want to clean it and need help, comment below. I will provide you the detailed steps for that too.