4
votes

I am trying to generate word vectors using PySpark. Using gensim I can see the words and the closest words as below:

sentences = open(os.getcwd() + "/tweets.txt").read().splitlines()
w2v_input=[]
for i in sentences:
    tokenised=i.split()
    w2v_input.append(tokenised)
model = word2vec.Word2Vec(w2v_input)
for key in model.wv.vocab.keys():
    print key
    print model.most_similar(positive=[key])

Using PySpark

inp = sc.textFile("tweet.txt").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)

How can I generate the words from the vector space in model? That is the pyspark equivalent of the gensim model.wv.vocab.keys()?

Background: I need to store the words and the synonyms from the model in a map so I can use them later for finding the sentiment of a tweet. I cannot reuse the word-vector model in the map functions in pyspark as the model belongs to the spark context (error pasted below). I want the pyspark word2vec version instead of gensim because it provides better synonyms for certain test words.

 Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.SparkContext can only be used on the driver, not in code that it run on workers.

Any alternative solution is also welcome.

2

2 Answers

5
votes

The equivalent command in Spark is model.getVectors(), which again returns a dictionary. Here is a quick toy example with only 3 words (alpha, beta, charlie), adapted from the documentation:

sc.version
# u'2.1.1'

from pyspark.mllib.feature import Word2Vec
sentence = "alpha beta " * 100 + "alpha charlie " * 10
localDoc = [sentence, sentence]
doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(doc)

model.getVectors().keys()
#  [u'alpha', u'beta', u'charlie']

Regarding finding synonyms, you may find another answer of mine useful.

Regarding the error you mention and a possible workaround, have a look at this answer of mine.

0
votes

And as suggested here, if you want to include all the words in your document set the MinCount parameter accordingly (default=5):

word2vec = Word2Vec()
word2vec.setMinCount(1)