2
votes

Suppose I have a dataframe shown below:

|Text

|Storm in RI worse than last hurricane

|Green Line derailment in Chicago

|MEG issues Hazardous Weather Outlook

I created word2vec model using below code:

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

text_data = sent_to_words(df['Text'])
w2v_model = gensim.models.Word2Vec(text_data, size=100, min_count=1, window=5, iter=50)

now how I will convert the text present in the 'Text' column to vectors using this word2vec model?

2

2 Answers

0
votes

you can get generated word embeddings by

w2v_model.wv

you can get word embeddings of a specific word by

w2v_model.wv['word']
0
votes

Word2Vec models can only map words to vectors, so, as @metalrt mentioned, you have to use a function over the set of word vectors to convert them to a single sentence vector. A good baseline is to compute the mean of the word vectors:

import numpy as np

df["Text"].apply(lambda text: np.mean([w2v_model.wv[word] for word in text.split() if word in w2v_model.wv]))

The example above implements very simple tokenization by whitespace characters. You can also use spacy library to implement better tokenization:

import spacy
nlp = spacy.load("en_core_web_sm")

df["Text"].apply(lambda text: np.mean([self.keyed_vectors[token.text] for token in nlp.pipe(text) if not token.is_punct and token.text in self.keyed_vectors]))