2
votes

We are trying to deploy an apache Flink job on a K8s Cluster, but we are noticing an odd behavior, when we start our job, the task manager memory starts with the amount assigned, in our case is 3 GB.

taskmanager.memory.process.size: 3g

eventually, the memory starts decreasing until it reaches about 160 MB, at that point, it recovers a little memory so it doesn't reach its end.

Image 1

Image 2

that very low memory often causes that the job is terminated due to task manager heartbeat exception even when trying to watch the logs on Flink dashboard or doing the job's process.

Why is it going so low on memory? we expected to have that behavior but in the range of GB because we assigned those 3Gb to the task manager even if we change our task manager memory size we have the same behavior.

Our Flink conf looks like this:

flink-conf.yaml: |+
taskmanager.numberOfTaskSlots: 1
    blob.server.port: 6124
    taskmanager.rpc.port: 6122
    taskmanager.memory.process.size: 3g
    metrics.reporters: prom
    metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
    metrics.reporter.prom.port: 9999
    metrics.system-resource: true
    metrics.system-resource-probing-interval: 5000
    jobmanager.rpc.address: flink-jobmanager
    jobmanager.rpc.port: 6123

is there a recommended configuration on K8s for memory or something that we are missing on our flink-conf.yml?

Thanks.

2

2 Answers

0
votes

Your configuration looks fine. It's most likely an issue with your code and some kind of memory leak. This is a very good answer describing what may be the problem.

You can try setting a limit on the JVM heap with taskmanager.memory.task.heap.size that you give the JVM some extra room to do GC, etc. But in the end, if you are allocating something that is not being referenced you will run into the situation.

Presumably, you are using your memory to store your state in which case you can also try RockDB as a state backend in case you are storing large objects.

0
votes

What are your requests/limits in you deployment templates? If there are no specified request sizes you may be seeing your cluster resources get eaten.