How to train word2vec model efficiently in the spark cluster environment?

Question

I want to train word2vec model about 10G news corpus on my Spark cluster. The following is the configration of my spark cluster：

One Master and 4 Worker
each with 80G memory and 24 Cores

However I find training Word2vec using Spark Mllib does't take full advantage of the cluster's resource. For example: the pic of top command in ubuntu

As the above picture shows,only 100% cpu is used in a worker,the other three worker is not in use(so not paste the their picture) and Just now I how trained a word2vec model about 2G news corpus,It takes about 6h,So I want to know how to train the model more efficiently?Thank everyone in advance:)

UPDATE1:the following command is what I used in the spark-shell

how to start spark-shell spark-shell \ --master spark://ip:7077 \ --executor-memory 70G \ --driver-memory 70G \ --conf spark.akka.frameSize=2000 \ --conf spark.driver.maxResultSize=0 \ --conf spark.default.parallelism=180
the following command is what I used to train word2vec model in the spark-shell://import related packages import org.apache.spark._ import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} //read about 10G newsdata corpus val newsdata = sc.textFile("hdfs://ip:9000/user/bd/newsdata/*",600).map(line => line.split(" ").toSeq) //Configure word2vec parameters val word2vec = new Word2Vec() word2vec.setMinCount(10) word2vec.setNumIterations(10) word2vec.setVectorSize(200) //train the model val model = word2vec.fit(newsdata)

UPDATE2:

I have train the model for about 24h and it doesn't complete. The cluster is running like this: only 100% cpu is used in a worker,the other three worker is not in use as before.

Post the code and command you are using to train your Word2Vec model. — kampta
Thank you very much for reply ,I have update my code used to train Word2Vec model. — Lei Li

mcd0well mcd0well · Accepted Answer · 2016-09-21T14:05:27

I experienced a similar problem in Python when training a Word2Vec model. Looking at the PySpark docs for word2vec here, it reads:

setNumIterations(numIterations) Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

New in version 1.2.0.

setNumPartitions(numPartitions)Sets number of partitions (default: 1). Use a small number for accuracy.

New in version 1.2.0.

My word2vec model stopped hanging, and Spark stopped running out of memory when I increased the number of partitions used by the model so that numIterations <= numPartitions

I suggest you set word2vec.setNumIterations(1) or word2vec.setNumPartitions(10).

How to train word2vec model efficiently in the spark cluster environment?

2 Answers