2
votes

We've been recently experiencing unexplained latency issues as refelcting from the ELB latency metric with our AWS setup.

Our setup includes and 3 EC2 c1.medium machines (each running an NGINX which talks to a uWSGI handler on the machine) behind an ELB.

Now, our traffic has peaks in morning and evening times but that doens't explain what we're seeing, i.e peaks of 10 seconds in latency well into the the traffic peak.

Our NGINX logs and uWSGI stats show that we are not queuing any requests and response times are solid under 500 ms.

Some config details:

ELB listens on port 8443 and transfers to 8080

NGINX has the following config on each EC2:

worker_processes 2;
pid /var/run/nginx.pid;

events {
    worker_connections 4000;
    multi_accept on;
    use epoll;
}

http {
    server {
        reset_timedout_connection on;
        access_log off;
        listen 8080;
        
        location / {
            include uwsgi_params;
            uwsgi_pass 127.0.0.1:3031;
        }
    }
}

I was wondering if someone had experienced something similar or can maybe supply an explanation.

Thank you..

1
Did you ever find the reason for this? Iam experiencing exactly the same issue at the moment.Alistair Prestidge
To our understanding, it seems that the latency data from the ELB includes the time until the connection is closed on the client side. Thus, when a client is using a bad or loaded network, it shifts the whole latency of the ELB report whereas we could have finished processing the request in 300 ms but it took another 700ms to transmit over, say, a 3G network. It fits what we're seeing from our clients.. makes any sense?wilfo
I'm seeing a similar issue, but the "client" in this case is my own benchmark from within the same AWS region, so I don't think it's client network issues.Dustin Boswell
In my case, I was originally using uwsgi's http server. When I switched to nginx in front of uwsgi's non-http server, the latency spikes went away. My nginx conf was slightly different (no multi_accept, epoll, or reset_timedout_connection) -- not sure if that matters.Dustin Boswell
Thanks for sharing. We don't see this anymore, sort of a passing phase. We also heard suggestion that clients running over poor cellular internet are at fault and that the latency occurs on their end. But, who knows..wilfo

1 Answers

2
votes

I'm not sure if it's documented somewhere but we've been using ELBs for quite a while. And in essence ELBs are EC2 instances in front of the instances you are load balancing, it's our understanding that when your ELB starts experiencing more traffic, Amazon does some magic to turn that ELB instance from say a c1.medium to an m1.xlarge.

So it could be that when you are starting to see peaks Amazon does some transitioning between the smaller to the larger ELB instance and you are seeing those delays.

Again customers don't know what goes on inside Amazon so for all you know they could be experiencing heavy traffic at the same time you have your peaks and their load balancers are going berserk.

You could probably avoid these delays by over-provisioning but who wants to spend more money.

There a couple of things that I would recommend if you have time and resources:

  1. Setup an haproxy instance in front of your environment (some large instance) and monitor your traffic that way. Haproxy has a command line (or web) utility that allows you to see stats. Of course you also need to monitor your instance for things like CPU and memory.

  2. You may not be able to do in production in which case you are going to have to run test traffic through it. I recommend using something like loader.io. Another options is to try to partially send some of the traffic to an haproxy instance, perhaps using GSLB (if your DNS provider supports it)