0
votes

Getting this issue with Spark 2.3.

I am running the task on a Cloudera cluster with 7 nodes: 64 GB ram, 16 cores each

Related conf: --conf spark.executor.memoryOverhead=5G --executor-memory 30G --num-executors 15 --executor-cores 5

error is raised by spark executors:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:110)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.StaticInvoke7$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:288)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Here's the code that I am running:

val table_df = spark.createDataFrame(table,schema)
val table_df_filled = table_df.na.fill("null")
table_df_filled.write.mode("overwrite").csv("path")

I have tried to increase executor/driver/overhead memory;

I have tried to increase the partition to several times larger(4000,8000) through spark.default.parallelism conf;

As to data size, for each row (record), there are several metadata columns and one big string column. I am sure that the problem lies on the big string column where I save a complete HTML code of a single webpage in this field (no more than 1 GB for each one I think?). The total data size is around 100GBs.

Has any one experienced similar issues?

Some follow up:

  • I have tried to print over the whole RDD, it ran through.
  • By counting on the Dataframe, the task failed with same issue. So I guess the problem is related to the Dataframe column size limit?
  • I managed to output the content directly from RDD with saveAsTextFile with no issue.
1
can you give more details of size of data you are trying to load and source of data and memory configuration - sai pradeep kumar kotha
Try to increase the number of partitions by calling table_df_filled.repartition(num_partitions) before writing. - Denis Makarenko
Hi, information added as mentioned. - Xinyue Wang

1 Answers

0
votes

It turns out that the cause of the problem is that some records hits the Array size limit during the DataFrame conversion from RDD. In this case, I have following two options:

  • Split the problem string colomn to multi colomns (reduce size).
  • Code the output format by myself and write data to HDFS through saveAsTextFile function directly from RDD.