2
votes

I want to train word2vec model about 10G news corpus on my Spark cluster. The following is the configration of my spark cluster:

  1. One Master and 4 Worker
  2. each with 80G memory and 24 Cores

However I find training Word2vec using Spark Mllib does't take full advantage of the cluster's resource. For example: the pic of top command in ubuntu

As the above picture shows,only 100% cpu is used in a worker,the other three worker is not in use(so not paste the their picture) and Just now I how trained a word2vec model about 2G news corpus,It takes about 6h,So I want to know how to train the model more efficiently?Thank everyone in advance:)


UPDATE1:the following command is what I used in the spark-shell

  1. how to start spark-shell spark-shell \ --master spark://ip:7077 \ --executor-memory 70G \ --driver-memory 70G \ --conf spark.akka.frameSize=2000 \ --conf spark.driver.maxResultSize=0 \ --conf spark.default.parallelism=180
  2. the following command is what I used to train word2vec model in the spark-shell: //import related packages import org.apache.spark._ import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} //read about 10G newsdata corpus val newsdata = sc.textFile("hdfs://ip:9000/user/bd/newsdata/*",600).map(line => line.split(" ").toSeq) //Configure word2vec parameters val word2vec = new Word2Vec() word2vec.setMinCount(10) word2vec.setNumIterations(10) word2vec.setVectorSize(200) //train the model val model = word2vec.fit(newsdata)

UPDATE2:

I have train the model for about 24h and it doesn't complete. The cluster is running like this: only 100% cpu is used in a worker,the other three worker is not in use as before.

2
Post the code and command you are using to train your Word2Vec model.kampta
Thank you very much for reply ,I have update my code used to train Word2Vec model.Lei Li
I have the same issue.blackbox

2 Answers

5
votes

I experienced a similar problem in Python when training a Word2Vec model. Looking at the PySpark docs for word2vec here, it reads:

setNumIterations(numIterations) Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

New in version 1.2.0.

setNumPartitions(numPartitions)Sets number of partitions (default: 1). Use a small number for accuracy.

New in version 1.2.0.

My word2vec model stopped hanging, and Spark stopped running out of memory when I increased the number of partitions used by the model so that numIterations <= numPartitions

I suggest you set word2vec.setNumIterations(1) or word2vec.setNumPartitions(10).

2
votes

As your model is taking too long to train, I think you should first try and understand how spark actually benefits the model training part. As per this paper,

Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descent). While each iteration can be expressed as a MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty

Spark mllib's libraries remove this performance penalty by caching the data in memory during the first iteration. So subsequent iterations are extremely quick compared to the first iteration and hence, there is a significant reduction in model training time. I think, in your case, the executor memory might be insufficient to load a partition of data in memory. Hence contents would be spilled to disk and would need to be fetched from disk again in every iteration, thus killing any performance benefits of spark. To make sure, this is actually the case, you should try and look at the executor logs which would contain some lines like "Unable to store rdd_x_y in memory". If this is indeed the case, you'll need to adjust --num-executors, --executor-memory and numPartitions to see which values of these parameters are able to load the entire data into memory. You can try out with a small data set, single executor and a small value of executor memory on your local machine and analyze logs while incrementally increasing executor memory to see at which config the data is totally cached in memory. Once you have the configs for the small data set, you can do the Maths to figure out how many executors with how much memory are required and what should be the number of partitions for the required partition size.

I had faced a similar problem and managed to bring down model training time from around 4 hours to 20 minutes by following the above steps.