1
votes

I'm trying to setup google cloud composer monitor via terraform, and this is my "helloworld" code (which works but not fulfill my criteria of acceptence):

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "3.5.0"
    }
  }
}

provider "google" {

  credentials = "some_credentials"

  project = "some_project"
  region  = "some_region"
  zone    = "some_zone"
}

resource "google_monitoring_notification_channel" "basic" {
  display_name = "Test name"
  type         = "email"
  labels = {
    email_address = "[email protected]"
  }
}

resource "google_monitoring_alert_policy" "cloud_composer_job_fail_monitor" {
  combiner              = "OR"
  display_name          = "Fails testing on cloud composer tasks"
  notification_channels = [google_monitoring_notification_channel.basic.id]
  conditions {
    display_name = "Failures count"
    condition_threshold {
      filter          = "resource.type=\"cloud_composer_workflow\" AND metric.type=\"composer.googleapis.com/workflow/task/run_count\" AND resource.label.\"project_id\"=\"some_project\" AND metric.label.\"state\"=\"failed\" AND resource.label.\"location\"=\"some_region\""
      duration        = "60s"
      comparison      = "COMPARISON_GT"
      threshold_value = 0
      aggregations {
        alignment_period   = "3600s"
        per_series_aligner = "ALIGN_COUNT"
        
      }
    }
    
  }
  documentation  {
        content = "Please checkout current incident"
    }
}

Problem: By default, notifications are sent when an alerting policy is either triggered or resolved (google doc).

My question: I want to get an alert notification every 30 minutes (for example) when Cloud Composer jobs will fail till I or someone else will not resolve an incident (or I need to understand why the incident is not resolved automatically when the job stop failing)

Can someone help with this issue?

Thank you for your help!

1
Since you already used OR as a combiner then you should always be notified if one of the condition is being met. If it is not triggered, you may need to check the logs. - Alex G
@AlexG Thanks, I understand that but for example, I create a task that raises an Error, and my monitor catch it - I'm getting a notification about the incident. After that, I delete this task so no error occurs but it does not automatically resolve the incident, and I'm not getting any notifications. So, in the end, I have no task with error, 1 still active incident and only one notification. - Roma D
Ok, so didn't found about every 30 minutes notification but need to change per_series_aligner = "ALIGN_COUNT" to per_series_aligner = "ALIGN_DELTA" to get notified when the job stop failing(also it is better to use alignment_period = "60s" and duration = "0s" to met condition faster) - Roma D
you can put is an answer @RomaD - c69

1 Answers

0
votes

The thing is to make changes in these fields:

  • per_series_aligner
  • duration
  • alignment_period

So these changes will make it possible to get an alert notification about cloud composer task with a failed state and actually change the trigger to met condition faster:

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "3.5.0"
    }
  }
}

provider "google" {

  credentials = "some_credentials"

  project = "some_project"
  region  = "some_region"
  zone    = "some_zone"
}

resource "google_monitoring_notification_channel" "basic" {
  display_name = "Test name"
  type         = "email"
  labels = {
    email_address = "[email protected]"
  }
}

resource "google_monitoring_alert_policy" "cloud_composer_job_fail_monitor" {
  combiner              = "OR"
  display_name          = "Fails testing on cloud composer tasks"
  notification_channels = [google_monitoring_notification_channel.basic.id]
  conditions {
    display_name = "Failures count"
    condition_threshold {
      filter          = "resource.type=\"cloud_composer_workflow\" AND metric.type=\"composer.googleapis.com/workflow/task/run_count\" AND resource.label.\"project_id\"=\"some_project\" AND metric.label.\"state\"=\"failed\" AND resource.label.\"location\"=\"some_region\""
      duration        = "0s"
      comparison      = "COMPARISON_GT"
      threshold_value = 0
      aggregations {
        alignment_period   = "60s"
        per_series_aligner = "ALIGN_DELTA"
        
      }
    }
    
  }
  documentation  {
        content = "Please checkout current incident"
    }
}

No information about a continuous notification (for example once in 30 min) with this kind of setting.

You will be notified only when your condition is being met.