We are using Flink streaming to run a few jobs on a single cluster. Our jobs are using rocksDB to hold a state. The cluster is configured to run with a single Jobmanager and 3 Taskmanager on 3 separate VMs. Each TM is configured to run with 14GB of RAM. JM is configured to run with 1GB.
We are experiencing 2 memory related issues: - When running Taskmanager with 8GB heap allocation, the TM ran out of heap memory and we got heap out of memory exception. Our solution to this problem was increasing heap size to 14GB. Seems like this configuration solved the issue, as we no longer crash due to out of heap memory. - Still, after increasing heap size to 14GB (per TM process) OS runs out of memory and kills the TM process. RES memory is rising over time and reaching ~20GB per TM process.
1. The question is how can we predict the maximal total amount of physical memory and heap size configuration?
2. Due to our memory issues, is it reasonable to use a non default values of Flink managed memory? what will be the guideline in such case?
Further details: Each Vm is configured with 4 CPUs and 24GB of RAM Using Flink version: 1.3.2
Raw Statein some operator: ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/…. Can you identify in which operator the problem appears and show some code? - TobiSH