0
votes

We have a flink job with roughly 30 operators. When we run this job with a parallelism of 12 flink outputs 400.000 metrics in total which is too many metrics for our metric platform to handle well.

When looking at the kind of metrics this does not seem to be a bug or anything like that.

It's just when having lots of operators with many taskmanagers and taskslots the number of metrics gets duplicated often enough to reach the 400.000 (maybe job restarts also duplicate the number of metrics?)

This is the config I use for our metrics:

metrics.reporters: graphite
metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.graphite.host: some-host.com
metrics.reporter.graphite.port: 2003
metrics.reporter.graphite.protocol: TCP
metrics.reporter.graphite.interval: 60 SECONDS
metrics.scope.jm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager
metrics.scope.jm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager.<job_name>
metrics.scope.tm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>
metrics.scope.tm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<task_id>.<subtask_index>
metrics.scope.operator: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<operator_id>.<subtask_index>

As we don't need all 400.000 of them, is it possible to influence which metrics are being exposed?

1

1 Answers

1
votes

You are probably experiencing the cardinality explosion of latency metrics present in some versions of Flink, wherein latencies are tracked from each source subtask to each operator subtask. This was addressed in Flink 1.7. See https://issues.apache.org/jira/browse/FLINK-10484 and https://issues.apache.org/jira/browse/FLINK-10243 for details.

For a quick fix, you could try disabling latency tracking by configuring metrics.latency.interval to be 0.