From Chef/LXC to Ansible/Docker

Sun Jul 2, 2017 by mike in geekery infrastructure-automation, chef, ansible, docker, lxc

Introduction

I recently changed the way I manage the handful of personal servers that I maintain. I was previously using Chef to provision containers on LXC. I’ve recently switched over to a combination of Ansible and Docker. So far, I’m happy with the switch. Going through the process, I feel like I’ve learned something about what each technology does well. The TLDR is:

Chef remains my favorite system for high-volume high-complexity configuration management work. Dependency management, test tooling, and the comfort and power of the language are all exceptional. But Chef itself is not low-maintenance, and the overhead of keeping a development environment functional dwarfs the “real” work when you have just a few hours of infrastructure automation to do each month.
The Ansible ecosystem is slightly less capable than Chef’s in almost every way, or at least I like using it less. It’s still really really good, though. It’s also simple to set up and never breaks. If you only do a little infrastructure automation, Ansible’s simplicity is ideal. If you do grow to have a very complex environment, Ansible will scale with you. I might be slightly happier managing many tens or a few hundred cookbooks in Chef, but I could certainly get the same job done in Ansible.
Dockerfiles are a massive step backward from both Ansible and Chef in every way. Most of the work done in Dockerfiles is so trivial that sloppy bash and sed for text-replacement is good enough, but it’s not good. I’ve found images on Docker Hub to do everything I want to so far, but when I need to write a nontrivial Dockerfile I’ll probably investigate ansible-container, or just use Ansible in my Dockerfile by installing it, running it in local-mode, and removing it in a single layer.
Though I don’t like the Docker tools for building complex images, I do like that it encourages (requires?) you to be much more rigorous about managing persisent state in volumes. For me Docker’s primary draw is that it helps me succeed at separating my persistent state from my software and config.

Read on for the details.

Your Mileage May Vary

I’m not advocating my own workflow or toolset for anyone else. I’m working in a toy environment, but my experiences might help inform your choices even if your needs are fairly different than mine.

My Environment

I’m doing all this to manage a handful of systems:

A single physical machine in my house that runs half a dozen services.
The Linode instance running this webserver.
Whatever physical, virtual, or cloud lab boxes I might be toying with at the minute.

It’s fairly ridiculous overkill for a personal network, but it gives me a chance to try out ideas in a real, if small, environment.

From LXC

When I was using LXC, I used it only on the physical box running multiple services, not the Linode or lab boxes. Because the physical box ran a bunch of different services, I wanted to isolate them and stop worrying about an OS upgrade or library version upgrade from one service breaking a different service. I chose LXC rather than Xen or VirtualBox because I was memory constrained and LXC containers shared memory more efficiently than “real” virtualization. I didn’t have to allocate memory for each service statically up-front, each container used only what it needed when it needed it. But each container was a “fat” operating system running multiple processes, with a full-blown init-system, SSH, syslog, and all the ancillary services you’d expect to be running on physical hardware or in a VM.

LXC did it’s job smoothly and caused me no problems, but I found I wasn’t any less nervous to do upgrades than before I had split my services into containers. Although my deployment and configuration process was automated, data backup and restore was as much of a hassle as it had always been. And in many cases, I didn’t even really know where services were storing their data, so I had no idea if I was going to have a problem until I tried the upgrade.

LXC does have a mechanism to mount external volumes, but it was manual in my workflow. And my experience with LXC plugins for Chef Provisioning and Vagrant was that they weren’t terribly mature. I didn’t want to try to attempt automating volume configuration in LXC, which set me thinking about alternatives.

To Docker

Docker has great volume support and tons of tooling to automate it, so I figured I’d try migrating.

I was able to find existing images on Docker Hub for all the services I wanted to run. The Dockerfiles used to build these images didn’t leave a great impression compared to the community Chef cookbooks they were replacing. They were much less flexible, exposing many fewer options. The build processes hardcoded tons of assumptions that seem like they’ll make maintenance of the Dockerfile flaky and brittle in the face of upstream changes. But they do work and they seem to be actively maintained. When an image failed to set the environment up as I desired, I was generally able to hack an entrypoint shellscript to fix things up as I desired on container startup. Where configuration options weren’t exposed, I was generally able to override the config-file(s) entirely by mounting them as volumes. It all feels pretty hacky, but each individual hack is simple enough to document in a 2 or 3 line comment, and the number of them is manageable.

By trading off the elegance of my Chef cookbooks for the tire fire of shell scripts defining each container, I’ve gained confidence that my data as well as my configs will be available in each container after upgrade. I’ve already killed and recreated my containers dozens of times in the process of setting them up, and expect to be able to do upgrades and downgrades for each container independently with the minimum necessary hassle.

From Chef

When I was using Chef, I used it manage all my systems. I used it to set up LXC on my container host, to manage the services running inside of each LXC container, to set up the web-service on my Linode, as well as to manage whatever ephemeral lab boxes I was messing with at the moment.

To launch a new service in an LXC container, I would manually launch a new LXC container running a minimal Ubuntu base-image. At the time, the tools I tried using to automated LXC generally had missing features, were unreliable, or both… so I stuck to the bundled command-line interface. Each container would have its own ip-address and DNS name, which I would add to my Chef Provisioning cookbooks as a managed host to deploy my software and configs to the container over SSH. Chef Provisioning would run a wrapper-cookbook specific each node that:

Called out to a base-cookbook to set up users, ssh, and other things that were consistent across all my systems.
Generally called out to some community cookbook to install and configure the service.
Overrode variables to control the behavior of the above cookbooks, and did whatever small tweaks weren’t already covered.

I used Berkshelf manage cookbook dependencies, which is a fantastic system modeled closely on Ruby’s bundler gem tool, and both tools have a powerful and flexible approach to dependency management.

The custom-cookbooks that I wrote had extensive testing to let me iterate on them quickly:

rubocop ran in milliseconds and ensured that any Ruby glue code in my system was syntactically valid and reasonably well styled.
Foodcritic similarly ran in milliseconds and ensured that my cookbooks were syntactically valid Chef code and reasonably well styled.
Chefspec unit tests ran in seconds and helped me iterate quickly to catch a large fraction of logic bugs.
test-kitchen and serverspec ran my cookbooks on real machines to provide slow feedback about the end-to-end behavior of my cookbooks in a real environment.
guard to automatically ran the appropriate tests whenever I saved changes to a source file.

When everything was working, I was able to iterate my cookbooks quickly, catch most errors without having to wait for slow runs against real hardware, enjoy writing Chef/ruby code, and have a lot of confidence in my deploys when I pushed changes to my “real” systems. The problem was, everything almost never worked. Every time I upgraded anything in my Chef development environment, something broke that took hours to fix, usually multiple somethings:

Upgrading gems inevitably resulted in an arcane error that required reading the source code of at least 2 gems to debug. I ended up maintaining personal patches for 6 different gems at one point or another.
ChefSpec tests regularly failed when I upgraded Chef, gems, or community cookbooks. Again, the errors were often difficult to interpret and required reading upstream source-code to debug (though the fixes were always mechanically simple once I understood them). I really like the idea of ChefSpec providing fast feedback on logic errors, but on balance, I definitely spent more time debugging synthetic problems that had no real-world implication than I spent catching “real” problems with ChefSpec.
Using LXC targets with test-kitchen was amazingly fast and memory efficient, but also amazingly brittle. The LXC plugin for test-kitchen didn’t reliably work, so I ended up using test-kitchen to drive vagrant to drive LXC. This setup was unreasonably complicated and frequently broke on upgrades. This pain was largely self-inflicted, test-kitchen can be simple and reliable when run with more popular backends, but it was frustrating nonetheless.
It’s idiomatic in Chef to store each cookbook in it’s own independent git repo (this makes sharing simpler at large scale). Gem versions, cookbook versions, and test configs are stored in various configuration files in the cookbook repository. This meant each upgrade cycle had to be performed separately for each cookbook, testing at each step to see what broke during the upgrade. Even when it went well, the boilerplate for this process was cumbersome, and it rarely went well.
Chef Provisioning was another self-imposed pain-point. Chef provisioning over SSH has been reliable for me, but it’s overkill for my basic use-case. When I started with it, it was very new and I thought I’d be learning an up-and-coming system that would later be useful at work. In fact, it never got a huge user-base and I switched to a job that doesn’t involve Chef at all. It ended up being a bunch of complexity and boilerplate code that could have easily been accomplished with Chef’s built-in knife tool.

ChefDK can help with a lot of these problems, but I always found that I wanted to use something that wasn’t in it, so I either had to maintain two ruby environments or hack up the SDK install, so I tended to avoid it and manage my own environments, which probably caused me more pain than necessary in the long run. When I found something that didn’t work in the ChefDK, I probably should have just decided not to do things that way.

But regardless of whether you use the ChefDK or not, the cost of these problems amortizes well over a large team working on infrastructure automation problems all day long. One person periodically does the work of upgrading, they fix problems, lock versions in source control, and push changes to all your cookbooks. The whole team silently benefits from their work, and also benefits from the ability to iterate quickly on a well-tested library of Cookbooks. When I was working with Chef/Ruby professionally, the overhead of this setup felt tiny and things I learned were relevant to my work. Now that I’m not using Chef/Ruby at work, every problem is mine to solve and it feels like a massive time sink. The iteration speed never pays off because I’m only hacking Chef a few hours a month. It became hugely painful.

To Ansible

Although I migrated many of my services to Docker, I haven’t gone all-in. Minimally, I still need to configure Docker itself on my physical Docker host. And more generally, I’m not yet convinced that I’ll want to use Docker for everything I do. For these problems, I’ve decided to use Ansible.

In most ways, this is a worse-is-better transition.

Ansible Galaxy seems less mature than Berkshelf for managing role dependencies, and the best practice workflow certainly seem less well-documented (do you check in downloaded roles, or download them on-demand using requirements.yml, what’s the process for updating roles in each case?).
Standard practices around testing Ansible roles seem way less mature compared to what I’m used to in the Chef community, and seem mostly limited to running a handful of asserts in the role and running the role in a VM.
The yaml language feels less pleasant to read and write than Ruby to me, though practically they both work for my needs.
I won’t speak to Ansible’s extensibility as I haven’t attempted anything other than writing roles that use built-in resources.

Even though I feel like I’m accepting a downgrade in all the dimensions listed above, Ansible is good enough at each of those things that I don’t really miss the the things I like better about Chef. And the amount of time I spending fixing or troubleshooting Ansible tooling can be effectively rounded to zero. This simplicity and reliability more than makes up for the other tradeoffs.

I now have each of my handful of physical/cloud hosts defined in my Ansible inventory file and my Docker containers are defined in a role using docker_image and docker_container. Perhaps someday I’ll migrate to using Docker Compose but for now this is working well.

Testing is simple and basic, I have a Vagrantfile that launches one or two VirtualBox instances and runs whatever role(s) I’m current hacking on them (including the Docker roles if necessary). Testing a change takes a minute or two, but things mostly work the first time and when there is a problem the fix is usually simple and obvious. Even though my feedback cycle is slower with Ansible, I find that iteration is faster because I’m working on the problem I’m trying to solve instead of yak-shaving six degrees of separation away.

Conclusion

I miss writing Chef cookbooks, it’s still my favorite configuration management system. The overhead of maintaining simple things in Chef is just too high, though, and the power of its ecosystem can’t offset that overhead on small projects. My life with Ansible and Docker feels messier, but it’s also simpler.

I’ve also come to appreciate that while having a really sophisticated configuration management and deployment system is great, it does you precious little good if your management of persistent state across upgrades and node-replacement isn’t similarly sophisticated. Building images with Dockerfiles feels like a huge step backward in terms of configuration management sophistication, but it’s a huge step forward in terms of state management, and that’s a tradeoff well worth making in many situations.