I am trying to estimate end to end tuple latency of my events using the latency metrics exported by Flink (I am using a Prometheus metrics reporter). All is good and I can see the latency metric in my Grafana/Prom dashboard. Looks something like
flink_taskmanager_job_latency_source_id_source_subtask_index_operator_id_operator_subtask_index_latency{
host="",instance="",job="",
job_id="",job_name="",operator_id="",operator_subtask_index="0",
quantile="0.99",source_id="",source_subtask_index="0",tm_id=""}
This test job I have is a simple source->map->sink
operation, with parallelism set to 1. I can see from the Flink dashboard that all them gets chained together into one task. For one run of my job, I see two sets of latency metrics. Each set shows all quantiles like (.5, .95..). Only thing different between the two sets is the operator_id
. I assumed this means one operator_id
belongs to the map
operator and the other belongs to the sink
.
Now my problem is that is no intuitive way to distinguish between the two (find out which operator_id is the map vs sink
), just by looking at the metrics. So my questions are essentially:
- Is my assumption correct?
- What is the best way to distinguish the two operators? I tried assigning names to my
map
andsink
. Even though these names show up in other metrics likenumRecordsIn
, the names does not show up in the latency metric. - Is there a way to get the mapping between
operator_id
andoperator_name
?