I am using spark mllib to generate word vectors. I wish to fit all my data and then get the trained word vectors and dump them to a file.
I am doing this :
JavaRDD<List<String>> data = javaSparkContext.parallelize(streamingData, partitions);
Word2Vec word2vec = new Word2Vec();
Word2VecModel model = word2vec.fit(data);
So, if my training data had sentences like
I love Spark
I want to save the output in files as :
I 0.03 0.53 0.12...
love 0.31 0.14 0.12...
Spark 0.41 0.18 0.84...
After training, I am getting the vectors from the model object like this
Map<String, float[]> wordMap = JavaConverters.mapAsJavaMapConverter(model.getVectors()).asJava();
List<String> wordvectorlist = Lists.newArrayList();
for (String s : wordMap.keySet()) {
StringBuilder wordvector = new StringBuilder(s);
for (float f : wordMap.get(s)) {
wordvector.append(" " + f);
}
wordvectorlist.add(wordvector.toString());
if (wordvectorlist.size() > 1000000) {
writeToFile(wordvectorlist);
wordvectorlist.clear();
}
}
I will be generating these word vectors for a very huge data (~1.5 TB) and thus, I might not be able to save the returned object Word2VecModel in memory of my driver. How can I store this wordvectors map as a rdd so that I can write to files without storing the full map in driver memory?
I looked into word2vec implementation of deeplearning4j but that implementation also requires loading all the vectors in driver memory.
word2vec
library and have stmbled across the similar problem. I have a huge file to load (around 6GB) which will makereading into memory
process very difficult. Based on your comment above, it looks like we now have server based implementation. Could you point me to the documentation/examples for the same? – Darshan Mehta