3
votes

I am trying to estimate end to end tuple latency of my events using the latency metrics exported by Flink (I am using a Prometheus metrics reporter). All is good and I can see the latency metric in my Grafana/Prom dashboard. Looks something like

flink_taskmanager_job_latency_source_id_source_subtask_index_operator_id_operator_subtask_index_latency{
  host="",instance="",job="",
  job_id="",job_name="",operator_id="",operator_subtask_index="0",
  quantile="0.99",source_id="",source_subtask_index="0",tm_id=""}

This test job I have is a simple source->map->sink operation, with parallelism set to 1. I can see from the Flink dashboard that all them gets chained together into one task. For one run of my job, I see two sets of latency metrics. Each set shows all quantiles like (.5, .95..). Only thing different between the two sets is the operator_id. I assumed this means one operator_id belongs to the map operator and the other belongs to the sink.

Now my problem is that is no intuitive way to distinguish between the two (find out which operator_id is the map vs sink), just by looking at the metrics. So my questions are essentially:

  1. Is my assumption correct?
  2. What is the best way to distinguish the two operators? I tried assigning names to my map and sink. Even though these names show up in other metrics like numRecordsIn, the names does not show up in the latency metric.
  3. Is there a way to get the mapping between operator_id and operator_name?
1
May I know how do you export task related metrics to Prometheus ? By default it seems it just export job manager related metrics to Prometheus . what else needed to configure?YuFeng Shen

1 Answers

2
votes

The operator_id is currently a hash value either computed from the hash values of the inputs and the node itself or if you have set a UID via uid for an operator, it is computed as the murmur3_128 hash of this id.

Please open a JIRA issue to add this feature to Flink.