1
votes

We are running a structured streaming process with spark 2.4.3 that reads data from kafka, transform the data (flattens and creates some columns using udf) after that the data is written back to a kafka to a different topic. the stream is processingTime is every two minutes. after 10-12 hours we noticed that our pods are going down because of high memory consumption. As I explained above we have no aggregation and not using persist on dataset. What we noticed is that the heap memory is constantly growing. any idea?

1

1 Answers

1
votes

We found the solution to the issue, it took a while, apparently spark holds the objects used for the UI and this collection was growing constantly although we configured spark to run with spark.ui.enabled: false. the solution was to limit it using the configuration parma spark.sql.ui.retainedExecutions. spark sql ui data We reproduced the memory issue easily since our dataset has around 300 columns so the UI saved data was very big.