I want to train word2vec model about 10G news corpus on my Spark cluster. The following is the configration of my spark cluster:
- One Master and 4 Worker
- each with 80G memory and 24 Cores
However I find training Word2vec using Spark Mllib does't take full advantage of the cluster's resource. For example: the pic of top command in ubuntu
As the above picture shows,only 100% cpu is used in a worker,the other three worker is not in use(so not paste the their picture) and Just now I how trained a word2vec model about 2G news corpus,It takes about 6h,So I want to know how to train the model more efficiently?Thank everyone in advance:)
UPDATE1:the following command is what I used in the spark-shell
- how to start spark-shell
spark-shell \ --master spark://ip:7077 \ --executor-memory 70G \ --driver-memory 70G \ --conf spark.akka.frameSize=2000 \ --conf spark.driver.maxResultSize=0 \ --conf spark.default.parallelism=180
- the following command is what I used to train word2vec model in the spark-shell:
//import related packages import org.apache.spark._ import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} //read about 10G newsdata corpus val newsdata = sc.textFile("hdfs://ip:9000/user/bd/newsdata/*",600).map(line => line.split(" ").toSeq) //Configure word2vec parameters val word2vec = new Word2Vec() word2vec.setMinCount(10) word2vec.setNumIterations(10) word2vec.setVectorSize(200) //train the model val model = word2vec.fit(newsdata)
UPDATE2:
I have train the model for about 24h and it doesn't complete. The cluster is running like this: only 100% cpu is used in a worker,the other three worker is not in use as before.