My Flink job is frequently going OOM with one or the other task manager. I have enough memory and storage for my job (2 JobManagers/16 TaskManagers - each with 15core and 63GB RAM). Sometimes the job runs 4 days before throwing OOM, sometimes job goes into OOM in 2 days. But the traffic is consistent compared to previous days.
I have a received a suggestion not to pass through objects in streaming pipeline and instead use primitives to reduce shuffling overhead and memory consumption.
The flink job I work is written in Java. Lets say below is my pipeline
Kafka source
deserialize (converted bytes to java object, the object contains String, int, long types)
FirstKeyedWindow (the above serialized java objects received here)
reduce
SecondKeyedWindow (the above reduced java objects received here)
reduce
Kafka sink (above java objects are serialized into bytes and are produced to kafka)
My question is what all should I consider to reduce the overhead and memory consumption? Will replacing String with char array helps reduce overhead a bit? or Should I only deal with bytes all through the pipeline? If I serialize the object between the KeyedWindows, will it help reduce the overhead? but if I have to read the bytes back, then I need to deserialize, use as required and then serialize it. Wouldn't it create more overhead of serializing/deserializing?
Appreciate your suggestions. Headsup, I am talking about 10TB of data received per day.
Update 1:
The exception I see for OOM is as below:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'host/host:port'. This might indicate that the remote task manager was lost.
Answering to David Anderson comments below: The Flink version used is v1.11 The state backend used is RocksDB, file system based. The job is running out of heap memory. Each message from Kafka source is sized up-to 300Bytes. The reduce function does deduplication (removes duplicates within the same group), the second reduce function does aggregation (updates the count within the object).
Update 2:
After thorough exploration, I found that Flink uses Kyro default serializer which is inefficient. I understood custom_serializers can help reduce overhead if we define one instead of using Kyro default. I am now trying out google-protobuf to see if it performs any better.
And, I am also looking forward to increase taskmanager.network.memory.fraction which suits to my job parallelism. Yet to find out the right calculation to set the above configuration.