I am currently working on a sparkling water application and I am a total beginner in spark and h2o.
What I want to do:
- loading a input textfile
- create a word2vec model
- create a dataframe with a column word and a column Vector
- using the dataframe as input for h2o
By creating the model i get a map, but i don't know how to create a dataframe of it. The output should look like that:
word | Vector
assert | [0.3, 0.4.....]
sense | [0.6, 0.2.....] and so on.
This is my code so far:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o
from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row
# Starting h2o application on spark cluster
hc = H2OContext(sc).start()
# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))
# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)
# Sanity check
model.findSynonyms("property",5)
# assign vector representation (map to variable
wordVectorsDF = model.getVectors()
# Transform wordVectorsDF word into dataframe
Is there any approach to that or functions provided by spark?
Thanks in advance