I'm running a Rails 3.2.21 app and deploy to a Ubuntu 12.04.5 box using capistrano (nginx and unicorn).
I have my app set for a zero-downtime deploy (at least I thought), with my config files looking more or less like these.
Here's the problem: When the deploy is nearly done and it restarts unicorn, when I watch my unicorn.log I see it fire up the new workers, reap the old ones... but then my app just hangs for 2-3 minutes. Any request to the app at this point hits the timeout window (which I set to 40 seconds) and returns my app's 500 error page.
Here is the first part of the output from unicorn.log as unicorn is restarting (I have 5 unicorn workers):
I, [2015-04-21T23:06:57.022492 #14347] INFO -- : master process ready
I, [2015-04-21T23:06:57.844273 #15378] INFO -- : worker=0 ready
I, [2015-04-21T23:06:57.944080 #15381] INFO -- : worker=1 ready
I, [2015-04-21T23:06:58.089655 #15390] INFO -- : worker=2 ready
I, [2015-04-21T23:06:58.230554 #14541] INFO -- : reaped #<Process::Status: pid 15551 exit 0> worker=4
I, [2015-04-21T23:06:58.231455 #14541] INFO -- : reaped #<Process::Status: pid 3644 exit 0> worker=0
I, [2015-04-21T23:06:58.249110 #15393] INFO -- : worker=3 ready
I, [2015-04-21T23:06:58.650007 #15396] INFO -- : worker=4 ready
I, [2015-04-21T23:07:01.246981 #14541] INFO -- : reaped #<Process::Status: pid 32645 exit 0> worker=1
I, [2015-04-21T23:07:01.561786 #14541] INFO -- : reaped #<Process::Status: pid 15534 exit 0> worker=2
I, [2015-04-21T23:07:06.657913 #14541] INFO -- : reaped #<Process::Status: pid 16821 exit 0> worker=3
I, [2015-04-21T23:07:06.658325 #14541] INFO -- : master complete
Afterwards, as the app hangs for those 2-3 minutes, here is what's happening:
E, [2015-04-21T23:07:38.069635 #14347] ERROR -- : worker=0 PID:15378 timeout (41s > 40s), killing
E, [2015-04-21T23:07:38.243005 #14347] ERROR -- : reaped #<Process::Status: pid 15378 SIGKILL (signal 9)> worker=0
E, [2015-04-21T23:07:39.647717 #14347] ERROR -- : worker=3 PID:15393 timeout (41s > 40s), killing
E, [2015-04-21T23:07:39.890543 #14347] ERROR -- : reaped #<Process::Status: pid 15393 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:07:40.727755 #16002] INFO -- : worker=0 ready
I, [2015-04-21T23:07:43.212395 #16022] INFO -- : worker=3 ready
E, [2015-04-21T23:08:24.511967 #14347] ERROR -- : worker=3 PID:16022 timeout (41s > 40s), killing
E, [2015-04-21T23:08:24.718512 #14347] ERROR -- : reaped #<Process::Status: pid 16022 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:08:28.010429 #16234] INFO -- : worker=3 ready
Eventually, after 2 or 3 minutes, the app starts being responsive again, but everything is more sluggish. You can see this very clearly in New Relic (the horizontal line marks the deploy, and the light blue area indicates Ruby):
I have an identical staging server, and I cannot replicate the issue in staging... granted, staging is under no load (I'm the only person trying to make page requests).
Here is my config/unicorn.rb file:
root = "/home/deployer/apps/myawesomeapp/current"
working_directory root
pid "#{root}/tmp/pids/unicorn.pid"
stderr_path "#{root}/log/unicorn.log"
stdout_path "#{root}/log/unicorn.log"
shared_path = "/home/deployer/apps/myawesomeapp/shared"
listen "/tmp/unicorn.myawesomeapp.sock"
worker_processes 5
timeout 40
preload_app true
before_exec do |server|
ENV['BUNDLE_GEMFILE'] = "#{root}/Gemfile"
end
before_fork do |server, worker|
if defined?(ActiveRecord::Base)
ActiveRecord::Base.connection.disconnect!
end
old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin"
if File.exists?(old_pid) && server.pid != old_pid
begin
Process.kill("QUIT", File.read(old_pid).to_i)
rescue Errno::ENOENT, Errno::ESRCH
end
end
end
after_fork do |server, worker|
if defined?(ActiveRecord::Base)
ActiveRecord::Base.establish_connection
end
end
And just to paint a complete picture, in my capistrano deploy.rb, the unicorn restart task looks like this:
namespace :deploy do
task :restart, roles: :app, except: { no_release: true } do
run "kill -s USR2 `cat #{release_path}/tmp/pids/unicorn.pid`"
end
end
Any ideas why the unicorn workers timeout right after the deploy? I thought the point of a zero-downtime was to keep the old ones around until the new ones are spun up and ready to serve?
Thanks!
UPDATE
I did another deploy, and this time kept an eye on production.log to see what was going on there. The only suspicious thing was the following lines, which were mixed in with normal requests:
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
UPDATE #2
As suggested by some of the answers below, I changed the before_fork
block to add sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU
so the workers would be incrementally killed off.
Same result, terribly slow deploy, with the same spike I illustrated in the graph above. Just for context, out of my 5 worker processes, the first 4 sent a TTOU signal, and the 5th sent QUIT. Still, does not seem to have made a difference.
executing "cd -- /home/deployer/apps/myawesomeapp/releases/20150422035053 && bundle exec rake RAILS_ENV=production RAILS_GROUPS=assets assets:precompile"
which is a command that takes around 30 seconds to finish. – DelPiero