Terraform resource recreation dynamic AWS RDS instance counts

Question

I have a question relating to AWS RDS cluster and instance creation.

Environment

We recently experimented with:

Terraform v0.11.11 provider.aws v1.41.0

Background

Creating some AWS RDS databases. Our mission was that in some environment (e.g. staging) we may run fewer instances than in others (e.g. production.). With this in mind and not wanting to have totally different terraform files per environment we instead decided to specify the database resources just once and use a variable for the number of instances which is set in our staging.tf and production.tf files respectively for the number of instances.

Potentially one more "quirk" of our setup, is that the VPC in which the subnets exist is not defined in terraform, the VPC already existed via manual creation in the AWS console, so this is provided as a data provider and the subnets for the RDS are specific in terraform - but again this is dynamic in the sense that in some environments we might have 3 subnets (1 in each AZ), whereas in others perhaps we have only 2 subnets. Again to achieve this we used iteration as shown below:

Structure

|-/environments
     -/staging
         -staging.tf
     -/production
         -production.tf
|- /resources
     - database.tf

Example Environment Variables File

dev.tf

terraform {
  terraform {
  backend "s3" {
    bucket         = "my-bucket-dev"
    key            = "terraform"
    region         = "eu-west-1"
    encrypt        = "true"
    acl            = "private"
    dynamodb_table = "terraform-state-locking"
  }

  version = "~> 0.11.8"
}

provider "aws" {
  access_key          = "${var.access_key}"
  secret_key          = "${var.secret_key}"
  region              = "${var.region}"
  version             = "~> 1.33"
  allowed_account_ids = ["XXX"]
}

module "main" {
  source                                  = "../../resources"
  vpc_name                                = "test"
  test_db_name                    = "terraform-test-db-dev"
  test_db_instance_count          = 1
  test_db_backup_retention_period = 7
  test_db_backup_window           = "00:57-01:27"
  test_db_maintenance_window      = "tue:04:40-tue:05:10"
  test_db_subnet_count            = 2
  test_db_subnet_cidr_blocks      = ["10.2.4.0/24", "10.2.5.0/24"]
}

We came to this module based structure for environment isolation mainly due to these discussions:

Our Issue

Initial resource creation works fine, our subnets are created, the database cluster starts up.

Our issues start the next time we subsequently run a terraform plan or terraform apply (with no changes to the files), at which point we see interesting things like:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:
module.main.aws_rds_cluster.test_db (new resource required)
id: "terraform-test-db-dev" => (forces new resource)
availability_zones.#: "3" => "1" (forces new resource)
availability_zones.1924028850: "eu-west-1b" => "" (forces new resource)
availability_zones.3953592328: "eu-west-1a" => "eu-west-1a"
availability_zones.94988580: "eu-west-1c" => "" (forces new resource)

and

module.main.aws_rds_cluster_instance.test_db (new resource required)
id: "terraform-test-db-dev" => (forces new resource)
cluster_identifier: "terraform-test-db-dev" => "${aws_rds_cluster.test_db.id}" (forces new resource)

Something about the way we are approaching this appears to be causing terraform to believe that the resource has changed to such an extent that it must destroy the existing resource and create a brand new one.

Config

variable "aws_availability_zones" {
  description = "Run the EC2 Instances in these Availability Zones"
  type        = "list"
  default     = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
}

variable "test_db_name" {
  description = "Name of the RDS instance, must be unique per region and is provided by the module config"
}

variable "test_db_subnet_count" {
  description = "Number of subnets to create, is provided by the module config"
}

resource "aws_security_group" "test_db_service" {
  name   = "${var.test_db_service_user_name}"
  vpc_id = "${data.aws_vpc.vpc.id}"
}

resource "aws_security_group" "test_db" {
  name   = "${var.test_db_name}"
  vpc_id = "${data.aws_vpc.vpc.id}"
}

resource "aws_security_group_rule" "test_db_ingress_app_server" {
  security_group_id        = "${aws_security_group.test_db.id}"
...
  source_security_group_id = "${aws_security_group.test_db_service.id}"
}

variable "test_db_subnet_cidr_blocks" {
  description = "Cidr block allocated to the subnets"
  type        = "list"
}

resource "aws_subnet" "test_db" {
  count             = "${var.test_db_subnet_count}"
  vpc_id            = "${data.aws_vpc.vpc.id}"
  cidr_block        = "${element(var.test_db_subnet_cidr_blocks, count.index)}"
  availability_zone = "${element(var.aws_availability_zones, count.index)}"
}

resource "aws_db_subnet_group" "test_db" {
  name       = "${var.test_db_name}"
  subnet_ids = ["${aws_subnet.test_db.*.id}"]
}

variable "test_db_backup_retention_period" {
  description = "Number of days to keep the backup, is provided by the module config"
}

variable "test_db_backup_window" {
  description = "Window during which the backup is done, is provided by the module config"
}

variable "test_db_maintenance_window" {
  description = "Window during which the maintenance is done, is provided by the module config"
}

data "aws_secretsmanager_secret" "test_db_master_password" {
  name = "terraform/db/test-db/root-password"
}

data "aws_secretsmanager_secret_version" "test_db_master_password" {
  secret_id = "${data.aws_secretsmanager_secret.test_db_master_password.id}"
}

data "aws_iam_role" "rds-monitoring-role" {
  name = "rds-monitoring-role"
}

resource "aws_rds_cluster" "test_db" {
  cluster_identifier = "${var.test_db_name}"
  engine             = "aurora-mysql"
  engine_version     = "5.7.12"

  # can only request to deploy in AZ's where there is a subnet in the subnet group.
  availability_zones              = "${slice(var.aws_availability_zones, 0, var.test_db_instance_count)}"
  database_name                   = "${var.test_db_schema_name}"
  master_username                 = "root"
  master_password                 = "${data.aws_secretsmanager_secret_version.test_db_master_password.secret_string}"
  preferred_backup_window         = "${var.test_db_backup_window}"
  preferred_maintenance_window    = "${var.test_db_maintenance_window}"
  backup_retention_period         = "${var.test_db_backup_retention_period}"
  db_subnet_group_name            = "${aws_db_subnet_group.test_db.name}"
  storage_encrypted               = true
  kms_key_id                      = "${data.aws_kms_key.kms_rds_key.arn}"
  deletion_protection             = true
  enabled_cloudwatch_logs_exports = ["audit", "error", "general", "slowquery"]
  vpc_security_group_ids          = ["${aws_security_group.test_db.id}"]
  final_snapshot_identifier       = "test-db-final-snapshot"
}

variable "test_db_instance_count" {
  description = "Number of instances to create, is provided by the module config"
}

resource "aws_rds_cluster_instance" "test_db" {
  count                = "${var.test_db_instance_count}"
  identifier           = "${var.test_db_name}"
  cluster_identifier   = "${aws_rds_cluster.test_db.id}"
  availability_zone    = "${element(var.aws_availability_zones, count.index)}"
  instance_class       = "db.t2.small"
  db_subnet_group_name = "${aws_db_subnet_group.test_db.name}"
  monitoring_interval  = 60
  engine               = "aurora-mysql"
  engine_version       = "5.7.12"
  monitoring_role_arn  = "${data.aws_iam_role.rds-monitoring-role.arn}"

  tags {
    Name = "test_db-${count.index}"
  }
}

My question is, is there a way to achieve this so that terraform would not try to recreate the resource (e.g. ensure that the availability zones of the cluster and ID of the instance do not change each time we run terraform.

Have you looked into Terraform's Workspaces feature (terraform.io/docs/state/workspaces.html) to help create different state files for each staging/prod environment? It seems like each run of terraform plan is trying to overwrite or recreate infrastructure from a different environment. — Adil B
It would be helpful if you could show/explain how you are configuring your state and what your workflow looks like here. As alluded to in @JamesWoolfenden's answer this might be a state file issue rather than anything else but it's hard to tell without you showing us what you're doing with your state. — ydaetskcoR
I have added my state management content to OP. no other resource in my terraform files have experienced this issue (we've been happily using this to manage SQS, SNS, S3, IAM users for months and haven't come across this until we started using RDS with the dynamic number of subnets and DB instances per cluster). — David
@AdilB we did indeed investigate workspaces, and started using them before However after being involved in discussions about this matter github.com/hashicorp/terraform/issues/13700 and github.com/hashicorp/terraform/issues/18632 it became apparent to us that using workspaces for this type of isolation is perhaps actually not advisable mainly due to the part that mentions different access control — David

David David · Accepted Answer · 2019-01-28T08:14:31

Turns out that simply by just removing the explicit availability zones definitions from the aws_rds_cluster and aws_rds_cluster_instance then this issue goes away and everything so far appears to work as expected. See also https://github.com/terraform-providers/terraform-provider-aws/issues/7307#issuecomment-457441633