1
votes

I am having a word embedding file as shown below click here to see the complete file in github.I would like to know the procedure for generating word embeddings So that i can generate word embedding for my personal dataset

in -0.051625 -0.063918 -0.132715 -0.122302 -0.265347 
to 0.052796 0.076153 0.014475 0.096910 -0.045046 
for 0.051237 -0.102637 0.049363 0.096058 -0.010658 
of 0.073245 -0.061590 -0.079189 -0.095731 -0.026899 
the -0.063727 -0.070157 -0.014622 -0.022271 -0.078383 
on -0.035222 0.008236 -0.044824 0.075308 0.076621 
and 0.038209 0.012271 0.063058 0.042883 -0.124830 
a -0.060385 -0.018999 -0.034195 -0.086732 -0.025636 
The 0.007047 -0.091152 -0.042944 -0.068369 -0.072737 
after -0.015879 0.062852 0.015722 0.061325 -0.099242 
as 0.009263 0.037517 0.028697 -0.010072 -0.013621 
Google -0.028538 0.055254 -0.005006 -0.052552 -0.045671 
New 0.002533 0.063183 0.070852 0.042174 0.077393 
with 0.087201 -0.038249 -0.041059 0.086816 0.068579 
at 0.082778 0.043505 -0.087001 0.044570 0.037580 
over 0.022163 -0.033666 0.039190 0.053745 -0.035787 
new 0.043216 0.015423 -0.062604 0.080569 -0.048067 
2

2 Answers

0
votes

I was able to convert each words in a dictionary to the above format by following the below steps:

  1. initially represent each words in the dictionary by unique integer
  2. take each integer one by one and perform array([[integer]]) and give it as input array in below code
  3. then the word corresponding to integer and respective output vector can be stored to json file ( i used output_array.tolist() for storing the vector in json format)
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding
model = Sequential()
model.add(Embedding(dictionary_size_here, sizeof_embedding_vector, input_length= input_length_here))
input_array = array([[integer]]) #each integer is fed one by one using a loop
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)

Reference

How does Keras 'Embedding' layer work?

0
votes

It is important to understand that there are multiple ways to generate an embedding for words. The popular word2vec, for example, can generate word embeddings using CBOW or Skip-grams.

Hence, one could have multiple "procedures" to generate word embeddings. One of the easier to understand method (albeit with its drawbacks) to generate an embedding is using Singular Value Decomposition (SVD). The steps are briefly described below.

  1. Create a Term-Document matrix. i.e. terms as rows and the document it appears in as columns.
  2. Perform SVD
  3. Truncate the output vector for the term to n dimension. In your example above, n = 5.

You can have a look at this link for a more detailed description using word2vec's skipgram model to generate an embedding. Word2Vec Tutorial - The Skip-Gram Model.

For more information on SVD, you can look at this and this.