2
votes

I am currently working on a sparkling water application and I am a total beginner in spark and h2o.

What I want to do:

  1. loading a input textfile
  2. create a word2vec model
  3. create a dataframe with a column word and a column Vector
  4. using the dataframe as input for h2o

By creating the model i get a map, but i don't know how to create a dataframe of it. The output should look like that:

word | Vector

assert | [0.3, 0.4.....]

sense | [0.6, 0.2.....] and so on.

This is my code so far:

from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o

from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row


# Starting h2o application on spark cluster
hc = H2OContext(sc).start()

# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))

# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)

# Sanity check
model.findSynonyms("property",5)

# assign vector representation (map to variable
wordVectorsDF = model.getVectors()

# Transform wordVectorsDF word into dataframe

Is there any approach to that or functions provided by spark?

Thanks in advance

2

2 Answers

3
votes

I found out that there are two libraries for a Word2Vec transformation - I don't know why.

from pyspark.mllib.feature import Word2Vec
from pyspark.ml.feature import Word2Vec

The second line returns a data frame with the function getVectors()and has diffenrent parameters for building a model from the first line.

Maybe somebody can comment on that concerning the 2 different libraries.

Thanks in advance.

-1
votes

First of all in H2O we don't support a Vector column type, you'd have to make a frame like this:

word   | V1  | V2  | ...
assert | 0.3 | 0.4 | ...
sense  | 0.6 | 0.2 | ...

Now for the actual question - no, since it's a Scala Map, we provide ways to create frames from data sources (files on HDFS/S3, databases etc) or conversions from RDDs/DataFrames but not from Java/Scala collections. Writing one would be possible but quite cumbersome.

Not the most performant solution but the easiest code-wise would be to make a DF (or RDD) first (by running sc.parallelize on map.toSeq) and then convert it to an H2OFrame:

import hc._
val wordsDF = sc.parallelize(wordVectorsDF.toSeq).toDF
val h2oFrame = asH2OFrame(wordsDF)