4
votes

I am using spark mllib to generate word vectors. I wish to fit all my data and then get the trained word vectors and dump them to a file.

I am doing this :

JavaRDD<List<String>> data = javaSparkContext.parallelize(streamingData, partitions);
Word2Vec word2vec = new Word2Vec();
Word2VecModel model = word2vec.fit(data);

So, if my training data had sentences like

I love Spark

I want to save the output in files as :

I       0.03 0.53 0.12...
love    0.31 0.14 0.12...
Spark   0.41 0.18 0.84...

After training, I am getting the vectors from the model object like this

Map<String, float[]> wordMap = JavaConverters.mapAsJavaMapConverter(model.getVectors()).asJava();
List<String> wordvectorlist = Lists.newArrayList();
for (String s : wordMap.keySet()) {
    StringBuilder wordvector = new StringBuilder(s);
    for (float f : wordMap.get(s)) {
        wordvector.append(" " + f);
    }
    wordvectorlist.add(wordvector.toString());
    if (wordvectorlist.size() > 1000000) {
        writeToFile(wordvectorlist);
        wordvectorlist.clear();
    }

}

I will be generating these word vectors for a very huge data (~1.5 TB) and thus, I might not be able to save the returned object Word2VecModel in memory of my driver. How can I store this wordvectors map as a rdd so that I can write to files without storing the full map in driver memory?

I looked into word2vec implementation of deeplearning4j but that implementation also requires loading all the vectors in driver memory.

1
Edit: At the advice of the admins I have made this a comment. Sorry for the spam. We are working on a parameter server based implementation for our next release. All I can say is keep an eye on the deeplearning4j implementation. This new parameter server based implementation will work with deep walk,glove, and paragraph vectors as well. If you are curious about this parameter server, we are basing it on nd4j here: github.com/deeplearning4j/nd4j/tree/master/… We welcome feedback if you are interested in telling us more about your use case.Adam Gibson
@AdamGibson thanks for your inputs. I am using word2vec library and have stmbled across the similar problem. I have a huge file to load (around 6GB) which will make reading into memory process very difficult. Based on your comment above, it looks like we now have server based implementation. Could you point me to the documentation/examples for the same?Darshan Mehta

1 Answers

3
votes

Word2VecModel has a save function which saves it to disk in its own format This will create a directory called data with parquet files of the data and a metadata file with human readable metadata.

You can now read the parquet file and convert it yourself or instead do spark.read.parquet to read it to dataframe. Each line would contain some of the map and you can write it any way you wish.