I have a spark streaming application that reads from kafka topic and writes the data to hdfs in parquet format. I see that during time(very short time) the physical memory of the container keeps growing until it reaches the maximum size and fails on "Diagnostics: Container [pid=29328,containerID=container_e42_1512395822750_0026_02_000001] is running beyond physical memory limits. Current usage: 1.5 GB of 1.5 GB physical memory used; 2.3 GB of 3.1 GB virtual memory used. Killing container." The container that is being killed is the same that runs the driver so the application is also being killed. When looking for this error I only saw solutions of increasing the memory but this I think will only postpone the problem. I want to understand why is the memory keeps increasing if I don't save anything in memory. I also saw that all containers have increases in memory but they are just being killed after a while(before reaching the maximum). I saw in some post "Your job is writing out Parquet data, and Parquet buffers data in memory prior to writing it out to disk".
The code we are using (we also tried without the repartition - not sure that is needed):
val repartition = rdd.repartition(6)
val df: DataFrame = sqlContext.read.json(repartition)
df.write.mode(SaveMode.Append).parquet(dbLocation)
Is there some way to just fix the increasing memory problem ?
The created parquet files

The nodeManager logs that show the increase in the memory





