Unicorn workers timeout after “zero downtime” deploy with capistrano

Question

I'm running a Rails 3.2.21 app and deploy to a Ubuntu 12.04.5 box using capistrano (nginx and unicorn).

I have my app set for a zero-downtime deploy (at least I thought), with my config files looking more or less like these.

Here's the problem: When the deploy is nearly done and it restarts unicorn, when I watch my unicorn.log I see it fire up the new workers, reap the old ones... but then my app just hangs for 2-3 minutes. Any request to the app at this point hits the timeout window (which I set to 40 seconds) and returns my app's 500 error page.

Here is the first part of the output from unicorn.log as unicorn is restarting (I have 5 unicorn workers):

I, [2015-04-21T23:06:57.022492 #14347]  INFO -- : master process ready
I, [2015-04-21T23:06:57.844273 #15378]  INFO -- : worker=0 ready
I, [2015-04-21T23:06:57.944080 #15381]  INFO -- : worker=1 ready
I, [2015-04-21T23:06:58.089655 #15390]  INFO -- : worker=2 ready
I, [2015-04-21T23:06:58.230554 #14541]  INFO -- : reaped #<Process::Status: pid 15551 exit 0> worker=4
I, [2015-04-21T23:06:58.231455 #14541]  INFO -- : reaped #<Process::Status: pid 3644 exit 0> worker=0
I, [2015-04-21T23:06:58.249110 #15393]  INFO -- : worker=3 ready
I, [2015-04-21T23:06:58.650007 #15396]  INFO -- : worker=4 ready
I, [2015-04-21T23:07:01.246981 #14541]  INFO -- : reaped #<Process::Status: pid 32645 exit 0> worker=1
I, [2015-04-21T23:07:01.561786 #14541]  INFO -- : reaped #<Process::Status: pid 15534 exit 0> worker=2
I, [2015-04-21T23:07:06.657913 #14541]  INFO -- : reaped #<Process::Status: pid 16821 exit 0> worker=3
I, [2015-04-21T23:07:06.658325 #14541]  INFO -- : master complete

Afterwards, as the app hangs for those 2-3 minutes, here is what's happening:

E, [2015-04-21T23:07:38.069635 #14347] ERROR -- : worker=0 PID:15378 timeout (41s > 40s), killing
E, [2015-04-21T23:07:38.243005 #14347] ERROR -- : reaped #<Process::Status: pid 15378 SIGKILL (signal 9)> worker=0
E, [2015-04-21T23:07:39.647717 #14347] ERROR -- : worker=3 PID:15393 timeout (41s > 40s), killing
E, [2015-04-21T23:07:39.890543 #14347] ERROR -- : reaped #<Process::Status: pid 15393 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:07:40.727755 #16002]  INFO -- : worker=0 ready
I, [2015-04-21T23:07:43.212395 #16022]  INFO -- : worker=3 ready
E, [2015-04-21T23:08:24.511967 #14347] ERROR -- : worker=3 PID:16022 timeout (41s > 40s), killing
E, [2015-04-21T23:08:24.718512 #14347] ERROR -- : reaped #<Process::Status: pid 16022 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:08:28.010429 #16234]  INFO -- : worker=3 ready

Eventually, after 2 or 3 minutes, the app starts being responsive again, but everything is more sluggish. You can see this very clearly in New Relic (the horizontal line marks the deploy, and the light blue area indicates Ruby):

New Relic graph during and after deploy

I have an identical staging server, and I cannot replicate the issue in staging... granted, staging is under no load (I'm the only person trying to make page requests).

Here is my config/unicorn.rb file:

root = "/home/deployer/apps/myawesomeapp/current"
working_directory root

pid "#{root}/tmp/pids/unicorn.pid"
stderr_path "#{root}/log/unicorn.log"
stdout_path "#{root}/log/unicorn.log"

shared_path = "/home/deployer/apps/myawesomeapp/shared"

listen "/tmp/unicorn.myawesomeapp.sock"
worker_processes 5
timeout 40

preload_app true

before_exec do |server|
  ENV['BUNDLE_GEMFILE'] = "#{root}/Gemfile"
end

before_fork do |server, worker|
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.connection.disconnect!
  end

  old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin"
  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill("QUIT", File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH

    end
  end
end

after_fork do |server, worker|
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.establish_connection
  end
end

And just to paint a complete picture, in my capistrano deploy.rb, the unicorn restart task looks like this:

namespace :deploy do
  task :restart, roles: :app, except: { no_release: true } do
    run "kill -s USR2 `cat #{release_path}/tmp/pids/unicorn.pid`"
  end
end

Any ideas why the unicorn workers timeout right after the deploy? I thought the point of a zero-downtime was to keep the old ones around until the new ones are spun up and ready to serve?

Thanks!

UPDATE

I did another deploy, and this time kept an eye on production.log to see what was going on there. The only suspicious thing was the following lines, which were mixed in with normal requests:

Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de

UPDATE #2

As suggested by some of the answers below, I changed the before_fork block to add sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU so the workers would be incrementally killed off.

Same result, terribly slow deploy, with the same spike I illustrated in the graph above. Just for context, out of my 5 worker processes, the first 4 sent a TTOU signal, and the 5th sent QUIT. Still, does not seem to have made a difference.

Are you precompiling assets during the deploy or is it compiling all of them when you hit the app for the first time? — infused
@infused It appears I am precompiling them during the deploy. I see a line in the deploy output that says executing "cd -- /home/deployer/apps/myawesomeapp/releases/20150422035053 && bundle exec rake RAILS_ENV=production RAILS_GROUPS=assets assets:precompile" which is a command that takes around 30 seconds to finish. — DelPiero
Op, did you manage to solve this issue? I am encountering the same with my prodution setup — Ganesh Hegde
@GaneshHegde Yeah, I did. It was a while ago, and while I don't remember exactly if I had to make any major config changes, the memory on my server had a lot to do with it. I moved to a different host and increased server memory. Also, if I recall correctly, upgrading to ruby 2 greatly impacted memory management and speed. I'm still on Rails 3.2, but now I have very fast, zero downtime deploys. You could test this by decreasing your Unicorn workers one at a time then deploying, to see if your deploys start getting faster. Hope this helps! — DelPiero

travisluong travisluong · Accepted Answer · 2015-04-27T05:19:11

I came across a similar problem recently while trying to set up Rails/Nginx/Unicorn on Digital Ocean. I was able to get zero-downtime deploys to work after tweaking a few things. Here are a few things to try:

Reduce the number of worker process.
Increase the memory of your server. I was getting timeouts on the 512MB RAM droplet. Seemed to fix the issue when I increased it to 1GB.
Use the "capistrano3-unicorn" gem.
If preload_app true, use restart (USR2). If false, use reload (HUP).
Ensure "tmp/pids" is in the set as a linked_dirs in deploy.rb.
Use px aux | grep unicorn to make sure the old processes are being removed.
Use kill [pid] to manually stop any unicorn processes still running.

Here's my unicorn config for reference:

working_directory '/var/www/yourapp/current'
pid '/var/www/yourapp/current/tmp/pids/unicorn.pid'
stderr_path '/var/www/yourapp/log/unicorn.log'
stdout_path '/var/www/yourapp/log/unicorn.log'
listen '/tmp/unicorn.yourapp.sock'
worker_processes 2
timeout 30
preload_app true

before_fork do |server, worker|
  old_pid = "/var/www/yourapp/current/tmp/pids/unicorn.pid.oldbin"
  if old_pid != server.pid
    begin
    sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU
    Process.kill(sig, File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
    end
  end
end

deploy.rb

lock '3.4.0'

set :application, 'yourapp'
set :repo_url, '[email protected]:username/yourapp.git'
set :deploy_to, '/var/www/yourapp'
set :linked_files, fetch(:linked_files, []).push('config/database.yml', 'config/secrets.yml', 'config/application.yml')
set :linked_dirs, fetch(:linked_dirs, []).push('log', 'tmp/pids', 'tmp/cache', 'tmp/sockets', 'vendor/bundle', 'public/system')
set :format, :pretty
set :log_level, :info
set :rbenv_ruby, '2.1.3'

namespace :deploy do
  after :restart, :clear_cache do
    on roles(:web), in: :groups, limit: 3, wait: 10 do
    end
  end
end

after 'deploy:publishing', 'deploy:restart'
namespace :deploy do
  task :restart do
    #invoke 'unicorn:reload'
    invoke 'unicorn:restart'
  end
end

Unicorn workers timeout after “zero downtime” deploy with capistrano

2 Answers