0
votes

I'm running Flink(1.4.2) on Yarn. I'm using Flink Yarn Client for submitting the job to Yarn Cluster.

Suppose I have a TM with 4 slots and I deploy a flink job with parallelism=4 with 2 container - 1 JM and 1 TM. Each parallel instance will be deployed in one task slot each in the TM (the entire job pipeline running per slot).

My jobs do a join(SQL time-windowed join on non-keyed stream) and they buffer last 3 hours of data. As per Flink docs the separate threads running in different task slot share data sets and data structures, thus reducing the per-task overhead.

My question is will these threads running in different task slot share this data buffered for join. What all data is shared across these threads.

Edit

Sample Query -

SELECT R.order_id, S.order.restaurant_id FROM awz_s3_stream1 R INNER JOIN awz_s3_stream2 S ON CAST(R.order_id AS VARCHAR) = S.order_id AND R.proctime BETWEEN S.proctime - INTERVAL '2' HOUR AND S.proctime + INTERVAL '2' HOUR GROUP BY HOP(S.proctime, INTERVAL '2' MINUTE, INTERVAL '1' HOUR), S.order.restaurant_id

1

1 Answers

0
votes

Each Task will receive its own disjunct partition of the input data. What is shared by the Tasks running on the same TaskManager are services and control data structures like the network stack, network connections, RPC endpoints, heartbeating between distributed components etc.