Spark job writing to parquet - has a container with physical memory that keeps increasing

Question

I have a spark streaming application that reads from kafka topic and writes the data to hdfs in parquet format. I see that during time(very short time) the physical memory of the container keeps growing until it reaches the maximum size and fails on "Diagnostics: Container [pid=29328,containerID=container_e42_1512395822750_0026_02_000001] is running beyond physical memory limits. Current usage: 1.5 GB of 1.5 GB physical memory used; 2.3 GB of 3.1 GB virtual memory used. Killing container." The container that is being killed is the same that runs the driver so the application is also being killed. When looking for this error I only saw solutions of increasing the memory but this I think will only postpone the problem. I want to understand why is the memory keeps increasing if I don't save anything in memory. I also saw that all containers have increases in memory but they are just being killed after a while(before reaching the maximum). I saw in some post "Your job is writing out Parquet data, and Parquet buffers data in memory prior to writing it out to disk".

The code we are using (we also tried without the repartition - not sure that is needed):

val repartition = rdd.repartition(6)
val df: DataFrame = sqlContext.read.json(repartition)
df.write.mode(SaveMode.Append).parquet(dbLocation)

Is there some way to just fix the increasing memory problem ?

The created parquet files

The nodeManager logs that show the increase in the memory enter image description here

nkasturi nkasturi · Accepted Answer · 2017-12-07T16:22:29

Assuming your application does nothing other than just writes, I suspect the root cause to be size of data being received in batches. It's possible that data received in one of the batches is beyond the thresholds configured. Assuming the application is killed for this season, the solution is to enable "back pressure". The solution is detailed enough in the post below.

Limit Kafka batches size when using Spark Streaming

Spark job writing to parquet - has a container with physical memory that keeps increasing

1 Answers