The problem:
I'm trying to build a Docker Swarm cluster on Digital Ocean, consisting of 3 "manager" nodes and however many worker nodes. The number of worker nodes isn't particularly relevant for this question. I'm trying to module-ize the Docker Swarm provisioning stuff, so its not specifically coupled to the digitalocean provider, but instead can receive a list of ip addresses to act against provisioning the cluster.
In order to provision the master nodes, the first node needs to be put into swarm mode, which generates a join key that the other master nodes will use to join the first one. "null_resource"s are being used to execute remote provisioners against the master nodes, however, I cannot figure out how dafuq to make sure the first master node completes doing its stuff ("docker swarm init ..."), before having another "null_resource" provisioner execute against the other master nodes that need to join the first one. They all run in parallel and predictably, it doesn't work.
Further, trying to figure out how to collect the first node's generated join-token and make it available to the other nodes. I've considered doing this with Consul, and storing the join token as a key, and getting that key on the other nodes - but this isn't ideal as... there are still issues with ensuring the Consul cluster is provisioned and ready (so kind of the same problem).
main.tf
variable "master_count" { default = 3 }
# master nodes
resource "digitalocean_droplet" "master_nodes" {
count = "${var.master_count}"
... etc, etc
}
module "docker_master" {
source = "./docker/master"
private_ip = "${digitalocean_droplet.master_nodes.*.ipv4_address_private}"
public_ip = "${digitalocean_droplet.master_nodes.*.ipv4_address}"
instances = "${var.master_count}"
}
docker/master/main.tf
variable "instances" {}
variable "private_ip" { type = "list" }
variable "public_ip" { type = "list" }
# Act only on the first item in the list of masters...
resource "null_resource" "swarm_master" {
count = 1
# Just to ensure this gets run every time
triggers {
version = "${timestamp()}"
}
connection {
...
host = "${element(var.public_ip, 0)}"
}
provisioner "remote-exec" {
inline = [<<EOF
... install docker, then ...
docker swarm init --advertise-addr ${element(var.private_ip, 0)}
MANAGER_JOIN_TOKEN=$(docker swarm join-token manager -q)
# need to do something with the join token, like make it available
# as an attribute for interpolation in the next "null_resource" block
EOF
]
}
}
# Act on the other 2 swarm master nodes (*not* the first one)
resource "null_resource" "other_swarm_masters" {
count = "${var.instances - 1}"
triggers {
version = "${timestamp()}"
}
# Host key slices the 3-element IP list and excludes the first one
connection {
...
host = "${element(slice(var.public_ip, 1, length(var.public_ip)), count.index)}"
}
provisioner "remote-exec" {
inline = [<<EOF
SWARM_MASTER_JOIN_TOKEN=$(consul kv get docker/swarm/manager/join_token)
docker swarm join --token ??? ${element(var.private_ip, 0)}:2377
EOF
]
}
##### THIS IS THE MEAT OF THE QUESTION ###
# How do I make this "null_resource" block not run until the other one has
# completed and generated the swarm token output? depends_on doesn't
# seem to do it :(
}
From reading through github issues, I get the feeling this isn't an uncommon problem... but its kicking my ass. Any suggestions appreciated!