1
votes

I have 2 EKS clusters, in 2 different AWS accounts and with, I might assume, different firewalls (which I don't have access to). The first one (Dev) is all right, however, with the same configuration, UAT cluster pods is struggling to resolve DNS. The Nodes can resolve and seems to be all right.

1) ping 8.8.8.8 works

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms

2) I can ping the IP of google (and others), but not the actual dns names.

Our configuration:

  1. configured with Terraform.
  2. The worker nodes and control plane SG are the same than the dev ones. I believe those are fine.
  3. Added 53 TCP and 53 UDP on inbound + outbound NACl (just to be sure 53 was really open...). Added 53 TCP and 53 UDP outbound from Worker Nodes.
  4. We are using ami-059c6874350e63ca9 with 1.14 kubernetes version.

I am unsure if the problem is a firewall somewhere, coredns, my configuration that needs to be updated or an "stupid mistake". Any help would be appreciated.

2
there is a lot of variables in your case, do you mind sharing your terraform script? Don't forget to remove sensitive data. Also need to read the yamls from your services, if you don't have them, please run kubectl get services -o yaml to export it and paste in your question. - Will R.O.F.

2 Answers

1
votes

Note that this issue may present itself in many forms (e.g. DNS not resolving is just one possible case). The terraform-awk-eks module exposes a terraform input to create the necessary security group rules that allow these inter worker-group/node-group communications: worker_create_cluster_primary_security_group_rules. More information in this terraform-awk-eks issue https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1089

When the input is enabled, terraform creates the following security group rules:

  # module.eks.module.eks.aws_security_group_rule.cluster_primary_ingress_workers[0] will be created                                                                                                                                                                                                                           
  + resource "aws_security_group_rule" "cluster_primary_ingress_workers" {                                                                                                                                                                                                                                                     
      + description              = "Allow pods running on workers to send communication to cluster primary security group (e.g. Fargate pods)."                                                                                                                                                                                
      + from_port                = 0                                                                                                                                                                                                                                                                                           
      + id                       = (known after apply)                                                                                                                                                                                                                                                                         
      + protocol                 = "-1"                                                                                                                                                                                                                                                                                        
      + security_group_id        = "sg-03bb33d3318e4aa03"                                                                                                                                                                                                                                                                      
      + self                     = false                                                                                                                                                                                                                                                                                       
      + source_security_group_id = "sg-0fffc4d49a499a1d8"                                                                                                                                                                                                                                                                      
      + to_port                  = 65535                                                                                                                                                                                                                                                                                       
      + type                     = "ingress"                                                                                                                                                                                                                                                                                   
    }                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                               
  # module.eks.module.eks.aws_security_group_rule.workers_ingress_cluster_primary[0] will be created                                                                                                                                                                                                                           
  + resource "aws_security_group_rule" "workers_ingress_cluster_primary" {                                                                                                                                                                                                                                                     
      + description              = "Allow pods running on workers to receive communication from cluster primary security group (e.g. Fargate pods)."                                                                                                                                                                           
      + from_port                = 0                                                                                                                                                                                                                                                                                           
      + id                       = (known after apply)                                                                                                                                                                                                                                                                         
      + protocol                 = "-1"                                                                                                                                                                                                                                                                                        
      + security_group_id        = "sg-0fffc4d49a499a1d8"                                                                                                                                                                                                                                                                      
      + self                     = false
      + source_security_group_id = "sg-03bb33d3318e4aa03"
      + to_port                  = 65535
      + type                     = "ingress"
    }
0
votes

After days of debugging, here is what was the problem : I had allowed all traffic between the nodes but that all traffic is TCP, not UDP.

It was basically a one line in AWS: In worker nodes SG, add an inbound rule from/to worker nodes port 53 protocol DNS (UDP).

If you use terraform, it should look like that:

resource "aws_security_group_rule" "eks-node-ingress-cluster-dns" {
  description = "Allow pods DNS"
  from_port                = 53
  protocol                 = 17
  security_group_id        = "${aws_security_group.SG-eks-WorkerNodes.id}"
  source_security_group_id = "${aws_security_group.SG-eks-WorkerNodes.id}"  
  to_port                  = 53
  type                     = "ingress"
}