Scaling Chef with more API Workers

We’re big fans of Opscode’s chef software at Etsy, and are using it on close to 700 nodes. Recently though, we found that we were beginning to see a large number of connection time outs during Chef runs. A little digging revealed that although the hardware on which we run Chef was by no means struggling, the API worker (the process running on port 4000 you point knife at by default) was continually maxing out a CPU core.

The default configuration which Chef ships with runs a single API worker, which is more than sufficient for most environments but evidently we’d hit the limit of what that worker could handle. Fortunately, scaling Chef to spawn more workers and make better use of a modern multi core machine is easy, though a little poorly documented. So, as with most of the posts I write here, I thought I’d document the process for anyone else hitting the same issues.

Please note, the following instructions are for Redhat / CentOS based systems, although most of the steps are platform agnostic.

The first step to multiple worker nirvana is to configure chef-server to start multiple worker processes. To do this, you’ll want to edit /etc/sysconfig/chef-server and change the OPTIONS line to the following, changing the number of processes as desired – in this example, we’re starting 8:

#Configuration file for the chef-server service
#CONFIG=/etc/chef/server.rb
#PIDFILE=/var/run/chef/server.pid
#LOCKFILE=/var/lock/subsys/chef-server
#LOGFILE=/var/log/chef/server.log
#PORT=4000
#ENVIRONMENT=production
#ADAPTER=thin
#CHILDPIDFILES=/var/run/chef/server.%s.pid
#SERVER_USER=chef
#SERVER_GROUP=chef
#Any additional chef-server options.
OPTIONS="-c 8"

Once you’ve done this, run /etc/init.d/chef-server restart, and then run “ps -ef | grep merb”. You should now see output similar to the following:

chef 16495 1 10 Feb23 ? 2-02:55:03 merb : chef-server (api) : worker (port 4000)
chef 16498 1 8 Feb23 ? 1-15:48:30 merb : chef-server (api) : worker (port 4001)
chef 16503 1 8 Feb23 ? 1-17:33:12 merb : chef-server (api) : worker (port 4002)
chef 16506 1 8 Feb23 ? 1-17:34:43 merb : chef-server (api) : worker (port 4003)
chef 16509 1 9 Feb23 ? 1-17:59:06 merb : chef-server (api) : worker (port 4004)
chef 16515 1 8 Feb23 ? 1-17:45:54 merb : chef-server (api) : worker (port 4005)
chef 16518 1 8 Feb23 ? 1-16:06:50 merb : chef-server (api) : worker (port 4006)
chef 16523 1 8 Feb23 ? 1-17:39:14 merb : chef-server (api) : worker (port 4007)

As you can see from the above output, the new worker processes have been started on ports 4000 through 4008. If we want our chef-clients to hit our new workers, we’re going to need a load balancer sitting in front of the workers. Luckily since our worker processes communicate over HTTP, we can use Apache for this through the use of it’s mod_proxy_balancer module. I’m going to assume that you’re familiar with the basics of setting up Apache here, and just cover the specifics of load balancing our workers.

The following vhost example shows how to enable the mod_proxy_balancer module and balance across our new worker processes.

<VirtualHost *:80>
   ServerName chef.mydomain.com
   DocumentRoot /usr/share/chef-server/public
   ErrorLog /var/log/httpd/_error_log
   CustomLog /var/log/httpd/access_log combined
   <Directory /usr/share/chef-server/public>
     Options FollowSymLinks
     AllowOverride None
     Order allow,deny
     Allow from all
   </Directory>
   <Proxy balancer://chefworkers>
     BalancerMember http://127.0.0.1:4001
     BalancerMember http://127.0.0.1:4002
     BalancerMember http://127.0.0.1:4003
     BalancerMember http://127.0.0.1:4004
     BalancerMember http://127.0.0.1:4005
     BalancerMember http://127.0.0.1:4006
     BalancerMember http://127.0.0.1:4007
   </Proxy>
   <Location /balancer-manager>
     SetHandler balancer-manager
     Order Deny,Allow
     Deny from all
     Allow from localhost
     Allow from 127.0.0.1
   </Location>
  RewriteEngine On
  RewriteCond %{REQUEST_URI} !=/balancer-manager
  RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f
  RewriteRule ^/(.*)$ balancer://chefworkers%{REQUEST_URI} [P,QSA,L]
</VirtualHost>

You might notice that I’ve omitted our original worker on port 4000 from the balancer pool – this is so that we can migrate traffic off our overloaded single worker without throwing any more at it. Once all of our nodes are talking to the load balanced pool, our original worker will be idle and can then safely be added into the pool with its fellows.

Once you’ve configured a suitable vhost with your worker pool, restart Apache and make sure that the host name you configured works properly. It’s also worth having a look at the balancer-manager we configured above as well (http://yourhost/balancer-manager) as this will show you the status of your worker pool and let you tweak weightings and so on if you so desire.

Now that our load balanced worker pool is up and running, all that remains is to point chef-client on our nodes at the new host name. I’m going to assume here that you’re cheffing out your client.rb file – you are cheffing out your client.rb, aren’t you? Anyway, this step is as simple as changing the chef-server line from port 4000 to port 80 (or whatever port you set up your Apache vhost on) – a sample snippet from client.rb is below:

# Main config
log_level :info
log_location "/var/log/chef/client.log"
ssl_verify_mode :verify_none
registration_url "http://chef.mydomain.com:80"
template_url "http://chef.mydomain.com:80"
remotefile_url "http://chef.mydomain.com:80"
search_url "http://chef.mydomain.com:80"
role_url "http://chef.mydomain.com:80"
client_url "http://chef.mydomain.com:80"
chef_server_url "http://chef.mydomain.com:80"

With that all done, presto chango – your chef-clients are now pointing at a shiny new pool of load balanced workers making use of as many CPU cores as you can throw at them. Once chef-client has run on all of your nodes, you’ll probably want to add our original worker on port 4000 into the loadbalancer pool again as well.

It’s worth noting that we found the optimum number of worker processes for our setup to be 10. We’re running close to 700 nodes with an interval of 450 seconds and a splay of 150 seconds, but your mileage may vary. Providing your chef-sever’s underlying hardware can handle it , keep adding workers until you stop seeing connection timeout errors. I’d recommend you don’t add more workers then you have CPU cores, and remember that you need to leave enough free cores for the rest of Chef’s processes.

Leave a Reply to Anonymous Cancel reply