Flink stucks at checkpoint creation

Question

I have a flink job which stucks in creating checkpoints. It almost has no state (beside some kafka offsets).

The job itself has this basic setup:

KafkaSource -> iterate -> HDFSSink

The iterate function again does a HTTP call and forwards the successes, throw away 4xx and retries 5xx. From what I can see from my metrics all of this happens I get some 5xx (back to iteration source) some 4xx (ignore) and a lot of 2xx (forward to HDFS).

If I look at the thread dump I can see that a certain task is blocked:

"Async calls on IterationSource-8 (1/1)" #123 daemon prio=5 os_prio=0 tid=0x00007f174000f800 nid=0x237 waiting for monitor entry [0x00007f17b32f5000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:747)
    - waiting to lock <0x00000000ace0f128> (a java.lang.Object)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:683)
    at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1155)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

this one is waiting for an object monitor which is hold by:

"IterationSource-8 (1/1)" #63 prio=5 os_prio=0 tid=0x00007f17c00bf000 nid=0x1e0 in Object.wait() [0x00007f17b17d2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:256)
    - locked <0x00000000acd030b0> (a java.util.ArrayDeque)
    at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:213)
    at org.apache.flink.runtime.io.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:181)
    at org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:256)
    at org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:184)
    at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:154)
    at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:120)
    at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
    at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
    at org.apache.flink.streaming.runtime.tasks.StreamIterationHead.performDefaultAction(StreamIterationHead.java:77)
    - locked <0x00000000ace0f128> (a java.lang.Object)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:298)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:403)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)

Looking closer at the source code I can see that the second thread (holding the lock) seems to be at some kind of endless loop:

LocalBufferPool.java:

while (availableMemorySegments.isEmpty()) {
}

Dear Flink gurus any clue at which metric to look at? I am using Flink 1.9.0.

Thanks in advance for any hint!

Anurag Anand Anurag Anand · Accepted Answer · 2020-05-20T10:03:50

I was getting similar checkpoints struck when I was using HTTP calls in Flink Sink. I figured out after lots of trail and error that, if sink rate per sec is slower than the input rate, checkpoint would get struck.

For the purpose, I specified parallelism of 1 for source(input), and parallelism of 8 for HTTP calls.

This would not block the thread while waiting for HTTP response so that checkpoints happens. I am also new to Flink and would like some guru to explain why checkpoints slows down when using HTTP calls inside flink.

Flink stucks at checkpoint creation

1 Answers