Spark OOM issue when writing dataframe to HDFS

Question

Getting this issue with Spark 2.3.

I am running the task on a Cloudera cluster with 7 nodes: 64 GB ram, 16 cores each

Related conf: --conf spark.executor.memoryOverhead=5G --executor-memory 30G --num-executors 15 --executor-cores 5

error is raised by spark executors:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:110)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.StaticInvoke7$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:288)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Here's the code that I am running:

val table_df = spark.createDataFrame(table,schema)
val table_df_filled = table_df.na.fill("null")
table_df_filled.write.mode("overwrite").csv("path")

I have tried to increase executor/driver/overhead memory;

I have tried to increase the partition to several times larger(4000,8000) through spark.default.parallelism conf;

As to data size, for each row (record), there are several metadata columns and one big string column. I am sure that the problem lies on the big string column where I save a complete HTML code of a single webpage in this field (no more than 1 GB for each one I think?). The total data size is around 100GBs.

Has any one experienced similar issues?

Some follow up:

I have tried to print over the whole RDD, it ran through.
By counting on the Dataframe, the task failed with same issue. So I guess the problem is related to the Dataframe column size limit?
I managed to output the content directly from RDD with saveAsTextFile with no issue.

can you give more details of size of data you are trying to load and source of data and memory configuration — sai pradeep kumar kotha
Try to increase the number of partitions by calling table_df_filled.repartition(num_partitions) before writing. — Denis Makarenko

Xinyue Wang Xinyue Wang · Accepted Answer · 2018-07-19T19:57:14

It turns out that the cause of the problem is that some records hits the Array size limit during the DataFrame conversion from RDD. In this case, I have following two options:

Split the problem string colomn to multi colomns (reduce size).
Code the output format by myself and write data to HDFS through saveAsTextFile function directly from RDD.

Spark OOM issue when writing dataframe to HDFS

1 Answers