Terraform doesn't seem to pick up manual changes

Question

I have a very frustrating Terraform issue, I made some changes to my terraform script which failed when I applied the plan. I've gone through a bunch of machinations and probably made the situation worse as I ended up manually deleting a bunch of AWS resources in trying to resolve this.
So now I am unable to use Terraform at all (refresh, plan, destroy) all get the same error.

The Situation

I have a list of Fargate services, and a set of maps which correlate different features of the fargate services such as the "Target Group" for the load balancer (I've provided some code below). The problem appears to be that Terraform is not picking up that these resources have been manually deleted or is somehow getting confused because they don't exist. At this point if I run a refresh, plan or destroy I get an error stating that a specific list is empty, even though it isn't (or should not be).
In the failed run I added a new service to the list below along with a new url (see code below)

Objective

At this point I would settle for destroying the entire environment (its my dev environment), however; ideally I want to just get the system working such that Terraform will detect the changes and work properly.

Terraform Script is Valid

I have reverted my Terraform scripts back to the last known good version. I have run the good version against our staging environment and it works fine.

Configuration Info

MacOS Mojave 10.14.6 (18G103)

Terraform v0.12.24.

provider.archive v1.3.0

provider.aws v2.57.0

provider.random v2.2.1

provider.template v2.1.2

The Terraform state file is being stored in a S3 bucket, and terraform init --reconfigure has been called.

What I've done

I was originally getting a similar error but it was in a different location, after many hours Googling and trying stuff (which I didn't write down) I decided to manually remove the AWS resources associated with the problematic code (the ALB, Target Groups, security groups)

Example Terraform Script

Unfortunately I can't post the actual script as it is private, but I've posted what I believe is the pertinent parts but have redacted some info. The reason I mention this is that any syntax type error you might see would be caused by this redaction, as I stated above the script works fine when run in our staging environment.

globalvars.tf

In the root directory. In the case of the failed Terraform run I added a new name to the service_names (edd = "edd") list (I added as the first element). In the service_name_map_2_url I added the new entry (edd = "edd") as the last entry. I'm not sure if the fact that I added these elements in different 'order' is the problem, although it really shouldn't since I access the map via the name and not by index

variable "service_names" {
  type = list(string)
  description = "This is a list/array of the images/services for the cluster"
  default = [
    "alert",
    "alert-config"
  ]
}

variable service_name_map_2_url {
  type = map(string)
  description = "This map contains the base URL used for the service"
  default = {
    alert = "alert"
    alert-config = "alert-config"
  }
}

alb.tf

In modules/alb. In this module we create an ALB and then a target group for each service, which looks like this. The items from globalvars.tf are passed into this script

locals {
  numberOfServices = length(var.service_names)
}

resource "aws_alb" "orchestration_alb" {
  name = "orchestration-alb"
  subnets = var.public_subnet_ids
  security_groups = [var.alb_sg_id]

  tags = {
    environment = var.environment
    group       = var.tag_group_name
    app         = var.tag_app_name
    contact     = var.tag_contact_email
  }
}

resource "aws_alb_target_group" "orchestration_tg" {
  count = local.numberOfServices
  name = "${var.service_names[count.index]}-tg"
  port = 80
  protocol = "HTTP"
  vpc_id = var.vpc_id
  target_type = "ip"
  deregistration_delay = 60
  tags = {
    environment = var.environment
    group       = var.tag_group_name
    app         = var.tag_app_name
    contact     = var.tag_contact_email
  }
  health_check {

        path = "/${var.service_name_map_2_url[var.service_names[count.index]]}/health"
        port = var.app_port
        protocol = "HTTP"
        healthy_threshold = 2
        unhealthy_threshold = 5
        interval = 30
        timeout = 5
        matcher = "200-308"
    }
}

output.tf

This is the output of the alb.tf, other things are outputted but this is the one that matters for this issue

output "target_group_arn_suffix" {
  value = aws_alb_target_group.orchestration_tg.*.arn_suffix
}

cloudwatch.tf

In modules/cloudwatch. I attempt to create a dashboard

data "template_file" "Dashboard" {
  template = file("${path.module}/dashboard.json.template")
  vars = {
  ...
   alert-tg = var.target_group_arn_suffix[0]
   alert-config-tg = var.target_group_arn_suffix[1]
   edd-cluster-name = var.ecs_cluster_name
   alb-arn-suffix = var.alb-arn-suffix
  }
}

Error

When I run terraform refresh (or plan or destroy) I get the following error (I get the same error for alert-config as well)

Error: Invalid index

  on modules/cloudwatch/cloudwatch.tf line 146, in data "template_file" "Dashboard":
 146:     alert-tg = var.target_group_arn_suffix[0]
    |----------------
    | var.target_group_arn_suffix is empty list of string

The given key does not identify an element in this collection value.

AWS Environment

I have manually deleted the ALB. Dashboard and all Target Groups. I would expect (and this has worked in the past) that Terraform would detect this and update its state file appropriately such that when running a plan it would know it has to create the ALB and target groups.

Thank you

Alain O'Dea Alain O'Dea · Accepted Answer · 2020-04-14T03:42:28

Terraform trusts its state as the single source of truth. Using Terraform in the presence of manual change is possible, but problematic.

If you manually remove infrastructure, you need to run terraform state rm [resource path] on the manually removed resource.

Gruntwork has what they call The Golden Rule of Terraform:

The master branch of the live repository should be a 1:1 representation of what’s actually deployed in production.