1
votes

We are running Flink on a 3 VM cluster. Each VM has about 40 Go of RAM. Each day we stop some jobs and start new ones. After some days, starting a new job is rejected with a "Cannot allocate memory" error :

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000340000000, 12884901888, 0) failed; error='Cannot allocate memory' (errno=12)

Investigations show that the task manager RAM is ever growing, to the point it exceeds the allowed 40 Go, although the jobs are canceled.

I don't have access (yet) to the cluster so I tried some tests on a standalone cluster on my laptop and monitored the task manager RAM:

  • With jvisualvm I can see everything working as intended. I load the job memory, then clean it and wait (a few minutes) for the GB to fire up. The heap is released.
  • Whereas with top, memory is - and stay - high.

enter image description here

At the moment we are restarting the cluster every morning to account for this memory issue, but we can't afford it anymore as we'll need jobs running 24/7.

I'm pretty sure it's not a Flink issue but can someone point me in the right direction about what we're doing wrong here?

1
you have to gain access! :-)xerx593

1 Answers

1
votes

On standalone mode, Flink may not release resources as you wished. For example, resources holden by static member in an instance.

It is highly recommended using YARN or K8s as runtime env.