0
votes

I am running spark on a single machine, with 24 cores, 48G Ram.

I am able to train an implicit model with 6M users, 1.2M items, 216M actions(views/buy)

Now, I am trying to run recommendations of 7M users and 1.5M items and 440M user actions on items.

I am using 20 executors, driver memory 15G, executor memory 4G.

Training with rank 8, 15 iterations.

I am getting java heap space out of memory error while training the model using ALS.trainImplicit.

model = ALS.trainImplicit(training_RDD, rank, seed=seed, iterations=iterations, lambda_=regularization_parameter, alpha=config.alpha)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/recommendation.py", line 314, in trainImplicit
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o86.trainImplicitALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 44, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
 at scala.collection.mutable.ArrayBuilder$ofInt.mkArray(ArrayBuilder.scala:323)

I am unable to get how to correct this error. from error: Lost task 0.0 in stage 4.0 (TID 44, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space , I could get that executor is running out of memory.

I have tried increasing executor memory, decreasing driver memory, but it didn't help, I am still getting the same error.

Stack trace of error: https://www.dropbox.com/s/g2vlmtjo8bb4gd1/javaheapspaceerror.txt?dl=0

1

1 Answers

0
votes

You need to set the checkpoint directory :-) A lot of algorithms use this when they are very staging-data intensive, since RDD lineage becomes crazy otherwise and causes these issues.

sc.setCheckpointDir('/tmp')