I am trying to estimate end to end tuple latency of my events using the latency metrics exported by Flink (I am using a Prometheus metrics reporter). All is good and I can see the latency metric in my Grafana/Prom dashboard. Looks something like
flink_taskmanager_job_latency_source_id_source_subtask_index_operator_id_operator_subtask_index_latency{
host="",instance="",job="",
job_id="",job_name="",operator_id="",operator_subtask_index="0",
quantile="0.99",source_id="",source_subtask_index="0",tm_id=""}
This test job I have is a simple source->map->sink operation, with parallelism set to 1. I can see from the Flink dashboard that all them gets chained together into one task. For one run of my job, I see two sets of latency metrics. Each set shows all quantiles like (.5, .95..). Only thing different between the two sets is the operator_id. I assumed this means one operator_id belongs to the map operator and the other belongs to the sink.
Now my problem is that is no intuitive way to distinguish between the two (find out which operator_id is the map vs sink), just by looking at the metrics. So my questions are essentially:
- Is my assumption correct?
- What is the best way to distinguish the two operators? I tried assigning names to my
mapandsink. Even though these names show up in other metrics likenumRecordsIn, the names does not show up in the latency metric. - Is there a way to get the mapping between
operator_idandoperator_name?