I am trying to generate word vectors using PySpark. Using gensim I can see the words and the closest words as below:
sentences = open(os.getcwd() + "/tweets.txt").read().splitlines()
w2v_input=[]
for i in sentences:
tokenised=i.split()
w2v_input.append(tokenised)
model = word2vec.Word2Vec(w2v_input)
for key in model.wv.vocab.keys():
print key
print model.most_similar(positive=[key])
Using PySpark
inp = sc.textFile("tweet.txt").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)
How can I generate the words from the vector space in model? That is the pyspark equivalent of the gensim model.wv.vocab.keys()
?
Background: I need to store the words and the synonyms from the model in a map so I can use them later for finding the sentiment of a tweet. I cannot reuse the word-vector model in the map functions in pyspark as the model belongs to the spark context (error pasted below). I want the pyspark word2vec version instead of gensim because it provides better synonyms for certain test words.
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.SparkContext can only be used on the driver, not in code that it run on workers.
Any alternative solution is also welcome.