2
votes

Have a Java app with auto-scaling on App Engine Standard Environment. Right now the scaling is configured like this:

<instance-class>F2</instance-class>

<automatic-scaling>
    <min-idle-instances>1</min-idle-instances>

    <!-- ‘automatic’ is the default value. -->
    <max-idle-instances>2</max-idle-instances>

    <!-- ‘automatic’ is the default value. -->
    <min-pending-latency>2000ms</min-pending-latency>

    <max-pending-latency>8000ms</max-pending-latency>
    <max-concurrent-requests>60</max-concurrent-requests>
</automatic-scaling>

Just started trying the F2 instance, was using F1 instances earlier. No matter how I configure my auto-scaling, it seems like the newly-created instance (created when load increases) starts getting all the incoming requests, while the resident instance sits with a very light load.

Why is this? Of course, I am unable to monitor the traffic (and which instance it goes to) in real time, but every time I look the story seems to be the same. I have included a few sample screenshots below.

enter image description here

and in the following case three instances (this was a slightly different configuration from the one above) are sitting free, but GAE's load balancer chooses to send all requests to the instance with the highest latency!

enter image description here

One more example: this is the request log for the resident instance started at 10:15:45 AM today:

enter image description here

and the request log for the dynamic instance that started 10 seconds later:

enter image description here

As you can see the dynamic instance is handling all of the requests (1889 so far) while the resident sits essentially idle (7 in the same time period). This would still be OK if not for the fact that the resident instances seem to be destroyed and created anew right around the time new dynamic instances are being created. This means that for a minute or so all requests see 10-20 second response times.

Can somebody please explain to me how to configure?

Here's what I want:

  • One idle instance should be able to handle the load most of the times (for now).
  • When more requests come in, spin up an additional instance. When it is ready, start diverting traffic to it.

I am trying to run a reasonable-load site on a shoestring budget, so it is important that I try to stay as close to the free quota as possible.

Update 1

Since both the answers talk about the warmup request prominently, I thought I'd list details about it here. I am using a ServletContextListener to handle the initialization. It does the following (times are gathered using Guava's Stopwatch class and are for the code I have written/am explicitly invoking):

  1. Register Objectify entities (1.449 s)
  2. Freemarker init 229 ms
  3. Firebase init 228.2 ms

Other than that I have the Shiro filter, the Objectify filter and Jersey filter (in Jersey I am avoiding classpath scanning (I think) by explicitly registering the classes rather than giving it a package to scan) configured in my web.xml. Not using any dependency injection to avoid classpath scanning.

The /_ah/warmup request took 7.8s (the one from where the above times are taken). But requests being served by a freshly-started dynamic instance whose warmup has already finished are taking 10+ seconds to complete, despite the fact that these same calls take 200-700ms two minutes later. So what else is going on in the background other than the stuff I am explicitly doing in my StartupListener?

Here's part 1 of the log and here's part 2 of the log.

2

2 Answers

3
votes

it seems like the newly-created instance (created when load increases) starts getting all the incoming requests, while the resident instance sits with a very light load.

My mental model is that Resident instances and warm up request are only useful when the boot time of your GAE instance is large. (I'm not sure if that's the intent, but that's the behavior I've observed)

Namely, traffic is sent to resident instances while the new instances are being booted (and other dynamic instances can't handle it). Once the new instance is up and running, traffic gets routed to it, instead of the resident instance.

Which means that if your instance boot time is low, then the resident instances won't be doing much work. An F2 can boot up in ~250ms (by my testing), so if your average response latency is 2000ms, then the new dynamic instance will have been booted completely before the resident instance finishes handling the request. As such, it'll be ready to handle subsequent requests instead of the resident one.

This appears to follow the behavior pattern you're seeing.

You might be able to confirm this by looking at how stackdriver and logging separate out your response time vs boot time. If boot time is really small, then resident instances might not help you much.

but GAE's load balancer chooses to send all requests to the instance with the highest latency!

Sadly there's not much info around how GAE decides which instance to send new packets to. All I've found is How instances are managed and scheduling settings which talks more about the params on when to boot new instances or not.

I know it's not the question you asked, but the 2000ms response time might be contributing to the issue here? If your min-pending-latency is set to 2000, then new requests will sit in the queue for 2000ms before a new instance will be spawned. But if it's being serviced in a serial fashion (threadsafe off) then response times that land between 1500 and 2000 would still be serviced properly.

I would suggest turning on threadsafe to see if that helps the scenario, and also add some custom tracing incase the code is doing something odd that you don't have visibility to.

2
votes

The role of the idle instance(s) is not to handle the usual traffic, but to be able to handle overflows - temporary peaks of traffic that the already running dynamic instances (if any) can't handle, until new instances are being started.

In a sense that's why they're called idle - most of the time they're just sitting idle. IMHO dropping idle instances is one of the first things to do when on budget pressure.

Also maybe relevant: In Google app engine only one instance handling most of requests

Side note: it's not that the GAE's load balancer chooses to send all requests to the instance with the highest latency!. Actually the latency on that preferred instance is the highest because it's the one getting the majority of the traffic.

To prevent GAE from sending traffic to new instances before they're ready to handle them you need to configure (and properly handle) warmup requests:

Warmup requests are a specific type of loading request that load application code into an instance ahead of time, before any live requests are made. To learn more about how to use warmup requests, see Warmup Requests. Manual or basic scaling instances do not receive an /_ah/warmup request.

You may also want to give this answer a thought: How do you keep a running instance for Google App Engine. Basically try to keep the dynamic instances running indefinitely by preventing them from remaining idle for too long via a cron job.

As for the apparent restarts of the resident instance right after new dynamic instances are started seems a bit odd. I wouldn't worry too much - it may simply be some sort of play-it-safe refresh strategy: that would be the moment when the need for the idle instance would be the lowest as the freshly started dynamic instance is the least likely to be overwhelmed by incoming requests.