We have a flink job with roughly 30 operators. When we run this job with a parallelism of 12 flink outputs 400.000 metrics in total which is too many metrics for our metric platform to handle well.
When looking at the kind of metrics this does not seem to be a bug or anything like that.
It's just when having lots of operators with many taskmanagers and taskslots the number of metrics gets duplicated often enough to reach the 400.000 (maybe job restarts also duplicate the number of metrics?)
This is the config I use for our metrics:
metrics.reporters: graphite
metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.graphite.host: some-host.com
metrics.reporter.graphite.port: 2003
metrics.reporter.graphite.protocol: TCP
metrics.reporter.graphite.interval: 60 SECONDS
metrics.scope.jm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager
metrics.scope.jm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager.<job_name>
metrics.scope.tm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>
metrics.scope.tm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<task_id>.<subtask_index>
metrics.scope.operator: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<operator_id>.<subtask_index>
As we don't need all 400.000 of them, is it possible to influence which metrics are being exposed?