0
votes

It seems that the number of slots allocated should be equal to the parallelism.
And the number of task managers should be equal to parallelism/(slot per TM).
But the application below is not like this.

The topology is as below. enter image description here

The parallelism is set to 140, and one slot per TM.
But only 115 slots are allocated. enter image description here

And the application throwed an exception after several minutes. enter image description here

The exception told that "Could not allocate all requires slots within timeout of 300000 ms. Slots required: 470, slots allocated: 388".

There are several questions here.

  1. "Slots required: 470", the calculation of this number seems to be regardless of slot share. Why?
  2. "slots allocated: 388", in fact, I only have 195 slots left, and the number of allocated slots are 115. Why?
  3. The parallelism is set to 140, but only 115 slots are allocated. Why?
3

3 Answers

0
votes

You need to check the configuration parameters:slotSharingGroup,it looks each operator's param(slotSharingGroup) is different.so the task need 470 slots。Details about the parameters:https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/

0
votes

When the scheduler says "Slots required: 470, slots allocated: 388", I believe this is saying that the scheduler needed to find slots for 470 tasks, but only succeeded in finding slots for 388 of them. 470 is the total number of tasks in your execution graph: 140 + 140 + 50 + 140.

Now to answer your question:

When slot sharing is fully enabled (which is the default), then a job needs as many slots as the degree of parallelism of the task with the highest parallelism. In your case this is 140.

But just because a job wants/needs 140 slots doesn't mean that the cluster will be able to provide that many. For some reason only 115 task managers have been started. Perhaps some resource or quota has been exhausted.

0
votes

With the help of workmates, I finally find out the key point.

It is indeed that some resource has been exhausted.

Though as a whole, there are still some free memories and cores. But on a machine's view, a part of the machines have some free memories, but no sufficient cores. And other machines have some free cores, but no sufficient memories.

The situation is like the table below(We need 4G for one TM in our situation).

enter image description here

Thus, containers can be started on none of these machines. Because we cannot find a machine with both sufficient memories and cores. And the application hangs on CREATING.

And about the message in the exception(which is thrown in version 1.6.4): "Could not allocate all requires slots within timeout of 300000 ms. Slots required: 470, slots allocated: 388". I think, in consideration of the number, 'slot' here actually means 'subtask', not 'core'.

In version 1.11.0, the message in the exception changes, it says 'Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources.' This description maybe more appropriate.