Bulbous, Not Tapered

Foo-fu and other favorites…

Variation in USB Audio Latency

Introduction

Not all usb sound devices are made the same with respect to latency, and the difference matters quite a lot if you hope to use your audio interface for real-time effects.

This post will show two cards with widely varying latency on Linux and Jack, but many of the concepts apply to other platforms as well.

Latency is not Throughput

It’s easy to find specs associated with audio interfaces that relate to “speed”. Knowing that your audio interface supports USB3 which can transmit data at 5000mbps won’t help you evaluate latency, though, and throughput isn’t generally interesting when evaluating an audio interface.

In theory, a high-throughput interface allows you to record more simultaneous tracks at higher bit-depths and higher-sample rates. Throughput limits were important in the early days of USB1.1 interfaces, when the 12mpbs of shared throughput could only support half-a-dozen hardware channels before saturating the USB1.1 bus. Those concerns are no longer an issue with modern audio interfaces, though. USB2, USB3, Firewire, Thunderbolt, and PCI all have more than enough throughput to handle many tens of channels of simultaneous audio at the highest bit-depths and sample-rates available. Audio manufacturers also design their interfaces so that you don’t have to think about throughput by matching channel counts to the available bandwidth. Throughput specs are easy to advertise, but they just don’t matter for an audio interface.

Round-trip latency is what you need to be concerned about if you intend to use your audio interface for real-time effects or synthesis. These specs are often much harder to come by, although some manufacturers like Focusrite do advertise their best-case measured latency. There is a great deal of variation in performance with regard to latency, and it varies not just from card to card, but from computer to computer depending on what operating system you run and how it’s configured. For some applications (like podcasting) the difference almost certainly doesn’t matter. For other applications (like real-time effects) the difference is almost certainly going to be both audible and distracting. I’m no expert on psychoaccoustics, but my impression (based on reading and listening) is that ~15ms is a good rule of thumb for when audio latency becomes distracting. Even among interfaces targeted at musicians, it’s common for audio interfaces to vary between 5ms and 30ms, which means some of them induce distracting amounts latency.

Impact of Latency

  • Podcasting: Non-musical voice recording can tolerate a relatively large amount of latency. I haven’t encountered any hardware that isn’t suitable for this purpose, and I wouldn’t worry about latency if this is your use-case.
  • Real-time Effects: Playing an instrument live, routing the audio signal into your interface, processing it with plugins, and then outputting it to headphones/speakers is the most demanding use-case in terms of latency… and also the most subjective. Some musicians may not notice 20ms of latency, others might. Even the same musician might not notice when playing a pad synth with a soft attack, and then be distracted by the same latency during a piano part. If your hardware isn’t capable of achieving sub-20ms latency, it’s likely you’ll hit some situation where someone notices and is distracted. Not all USB interfaces are capable of hitting that bar (probably most fail at it).
  • Soft-synths: Synthesis has an advantage over real-time effects in that it is effected only by output latency, and is not effected by audio input latency. While soft-synths are subject to midi-input latency when played from an external midi-based controller, that latency is usually much less than audio input latency. Interfaces with 25ms-30ms of round-trip latency as measured by jack_iodelay may still be able to achieve consistently acceptable latency for soft-synths.
  • Multi-Track Music Recording: When overdubbing multiple audio tracks, you often end up wanting to do soft-synths or real-time effects, in which case the above advice applies. For more basic multi-track recording, you can:
    • Monitor input directly (many interfaces have a direct/hardware monitoring switch or dial, or you can use an external mixer).
    • Have Jack compensate output timings based on the end-to-end round-trip latency measured by jack_io_delay. See the Ardour Manual and this Ardour Community Thread for details.

USB Latency Chain

ProAudioBlog, AndroidAuthority and the LinuxAudio wiki all have articles that provide solid introductions to the different places that can introduce latency into the audio path. I’ll provide my own take below:

Name Latency Description
Sound Generation 1ms - 10ms Instruments don’t react immediately to producce sound. A synthesizer is likely to have 5ms of latency. Even accoustic instruments have latency. The low-E on a bass guitar vibrates at 41.2Hz, or less than one complete vibration every 2 milliseconds. It must take several milliseconds to excite a recognizeable pitch.
Analog Cables and Components 0ms Signals propagate through copper at roughly 2/3rds of the speed of light. It would require over 10 miles of analog cable to introduce a single ms of latency. This will also apply to any purely analog outboard gear you may have, routing signals through a mixer or a compressor won’t affect latency in a detectable way.
Digital Outboard Components 2ms-?ms While purely analog outboard components behave like a bunch of analog cable from a latency perspective, digital outboard components behave like a tiny computer with their own buffers that introduce latency. Expect at least several ms of latency from any digital outboard gear.
Midi Input Latency 1ms If you’re driving a soft-synth from a midi-controller, you’ll experience midi input latency instead of ADC, OS, and Jack input processing. Midi input latency is generally low compared to audio input latency, often about a millisecond.
ADC 1ms-10ms The audio interface must take analog sound from the inputs and convert them to digital data, as well as apply any digital effects or processing.
OS Input Processing 1ms-5ms The USB subsystem, the audio driver, and possibly other OS components must process data coming in from the audio interface.
Jack Input Processing 1ms-5ms on a well tuned kernel. This is the number that jack and front-ends like Cadence and QJackctl give you. It’s NOT the end-to-end latency of your system.
Application Processing 0ms-?ms Any soft-synths, effects plugins, can introduce their own processing delays between when jack delivers audio samples or midi signals to them and when they output their processed audio samples on their output jack port.
Jack Output Processing 1ms-5ms on a well tuned kernel. This is the same number described in “Jack Input Processing”. In addition to the delay incurred on input, jack must process the audio on output and incur a second delay.
OS Output Processing 1ms-5ms The USB subsystem, audio drivers, and possibly other OS components must process data going out to the audio interface.
DAC 1ms-10ms The audio interface must take the output data and convert it to analog signal on the outputs, as well as perform any effects or signal processing.
Distance from Speaker 0ms-10ms Sound travels through the air at about 1 foot per millisecond. If you’re wearing headphones, this delay rounds to zero. If you’re 10 feet away from your speakers, sound traveling through the air may be eating up a good chunk of your latency budget.

The latency chain has some notable properties: 1. There are lots of possible sources of latency, almost any one of which can completely blow a 15ms latency budget. 2. It’s often difficult to know where in the signal path latency is coming from or know when latency from multiple components is stacking up to cause a problem. 3. Confusion about latency sources is made even worse by the fact that few components advertise or document their latency properties, you almost always have to measure latency yourself to have any idea what’s going on. 4. Lots of audio software (including Jack and front-ends like Cadence or QJackCtl) highlight their own latency prominently but tell you nothing about end-to-end latency. If you’ve ever read a forum post where someone said “My system has 2ms of latency and its very distracting!” you can be certain that person is misunderstanding latency reported by their software, has end-to-end latency of 20ms or more, but has no idea how much or what is contributing to it.

These challenges combine to ensure that there is an enormous amount of bad anecdotal advice on the internet about latency.

Listening For Latency

The simplest way to listen for latency is just to play and listen. Somewhere between 20ms and 75ms of end-to-end latency, you will start to hear and become distracted by the delay. Use stacatto notes so you can focus on the timing of the attack.

If you have the ability to mix your “direct” signal and the signal after routing it through your computer into a set of headphones, this will allow you to detect latency that may not be distracting (to you) but is perceptible (and therefore might be distracting to someone you play with if they’re more sensitive). Some audio interfaces have a built-in dial to mix “direct” (0-latency, 100% analog signal path from the interface inputs to the headphone outputs) and digital outputs from your computer. Or if you have an analog mixer with an effects bus or similar mechanism to route a signal out and back in, you can plug your mic or instrument into the analog mixer, send it to the computer via the effects bus, and monitor both signals in the headphones.

The most sensitive mechanism I’ve found to detect latency right at the threshold of what I’m able to percieve is to sing into a mic. On a zero-latency analog signal path, or one with less than about 10ms of end-to-end latency, this sounds “normal” to me. With 25ms of end-to-end latency, it sounds “weird”… similar to a vocal doubler or phaser effect. Unless you have a pretty good sense of how your equipment performs this technique probably won’t help you. Once you recognize the sound, though, it’s a quick and sensitive test to perform.

Measuring Latency

Focusrite Scarlett 2i2

First let’s verify that Alsa detects the presence of the card by connecting it via usb and running aplay -l to list all available Alsa devices. Card 0 is my laptop’s built-in sound-card, and card 1 shows the Scarlett 2i2 plugged in via USB:

$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: PCH [HDA Intel PCH], device 0: ALC293 Analog [ALC293 Analog]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: PCH [HDA Intel PCH], device 3: HDMI 0 [HDMI 0]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: PCH [HDA Intel PCH], device 7: HDMI 1 [HDMI 1]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: PCH [HDA Intel PCH], device 8: HDMI 2 [HDMI 2]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: USB [Scarlett 2i2 USB], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

In order to measure audio latency, we must:

Cadence Settings

  1. Configure jack. You can do this via the jackd command-line tool, qjackctl, or cadence. All the options are reasonable, but I tend to use cadence. After starting it and clicking configure, my settings look like the screenshot above.
    1. hw:USB is the name by which Jack knows the Scarlett
    2. The Sample Rate, Buffer Size (which is confusingly named in cadence since it sets the period size not the buffer size and is called frames/period in other programs), and Periods/Buffer indicate that Jack itself will take 2ms to process incoming audio and an additional 2ms to process outgoing audio… so 4ms of total latency coming from Jack itself.
    3. Click start in cadence to start up Jack. With these settings I do see periodic xruns every 10 or 20 minutes. This may be approaching the limits of the hardware’s ability to deliver data on-time, or my Linux instance may need futher tuning to process the data on-time. The occasional xruns are not distracting in practice sessions, though.
  2. Physically connect the left output on the Scarlett to input 1 on the Scarlett using a quarter-inch cable. This will allow jack_iodelay to measure it’s own output once we wire things up in Jack.
  3. In a terminal, run jack_iodelay.
  4. Start catia, which will let us wire up the Jack routes.
    1. Maximize catia and select Canvas -> Zoom -> Auto-Fit if the various ports are scrolled off-screen or are otherwise difficult to read.
    2. Connect capture_1 to jack_iodelay’s input, and connect jack_iodelay’s output to playback_1. In conjunction with our physical cable from step-2, we now have a closed loop that let’s jack_iodelay analyze the signal it’s generating to see how long it takes to complete the loop.

Back in the terminal window, jack_iodelay will now have some useful output:

$ jack_iodelay
new capture latency: [0, 0]
new playback latency: [0, 0]
Signal below threshold...
Signal below threshold...
< repeated many times while we wire up the connections in catia >
new capture latency: [32, 32]
   417.637 frames      8.701 ms total roundtrip latency
        extra loopback latency: 289 frames
        use 144 for the backend arguments -I and -O
< repeated endlessly until jack_iodelay is killed via Ctrl-c >

Lexicon Omega

Again, let’s verify that Alsa is detecting the card correctly with aplay:

$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: PCH [HDA Intel PCH], device 0: ALC293 Analog [ALC293 Analog]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: PCH [HDA Intel PCH], device 3: HDMI 0 [HDMI 0]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: PCH [HDA Intel PCH], device 7: HDMI 1 [HDMI 1]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: PCH [HDA Intel PCH], device 8: HDMI 2 [HDMI 2]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: Omega [Lexicon Omega], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: Omega [Lexicon Omega], device 1: USB Audio [USB Audio #1]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

Again test latency by:

  1. Configuring jackd via cadence. After selecting the appropriate device, This time jack refuses to start with a Buffer Size (aka frames/period) of 32. I get excessive xruns at 64, so have to bump all the way up to 128. Jack itself now adds 16ms of latency to the signal. Start jack.
  2. Physically connect the Omega’s left output to to its Line-1 input, and use the Omega’s hardware channel-selection button to assign Line-1/Line-2 to input 1 and input 2 seen by jackd.
  3. Start jack_iodelay.
  4. In catia, wire up capture_1 to the input on jack_iodelay, and wire up the output from jack_iodelay to playback_1
± % jack_iodelay                                                                                                                                                                                             !8112
new capture latency: [0, 0]
new playback latency: [0, 0]
Signal below threshold...
Signal below threshold...
< repeated many times while we wire up the connections in catia >
new playback latency: [384, 384]
  1132.755 frames     23.599 ms total roundtrip latency
        extra loopback latency: 620 frames
        use 310 for the backend arguments -I and -O
< repeated endlessly until jack_iodelay is killed via Ctrl-c >

Analyzing Latency Measurements

  • Previous reading had lead me to believe that there was approximately 10ms of unavoidable latency from ADC, DAC, and USB/alsa drivers. Apparently this isn’t always the case. For the Scarlett, The roundtrip latency is 8.7ms, of which jackd accounts for 4ms. This means that the ADC, DAC, and OS latency can’t total more than 4.7ms, half of what I had previously thought possible.
  • The Scarlett is an excellent result, and shows that USB audio interfaces can achieve latency well below the threshold of perception and even approach latency expected from dedicated digital audio devices like hardware synths and digital effects boxes, which I believe tend to run between 2ms and 6ms.
  • Not all usb audio interfaces achieve acceptable latency, and latency measurements on the internet are often wrong. Be wary of latency measurements that don’t specify how they were made, and try to test hardware on your own computer if your latency requirements are strict.

Disable Touchpad While Typing

Introduction

I have a Lenovo Thinkpad t460p laptop that currently runs Ubuntu 17.04. In general the system is a pleasure to use but one niggle has been mildly infuriating… the touchpad regularly engages when I’m typing and my cursor jumps to an unwanted position mid-word. It doesn’t happen frequently enough to be a serious problem, but it does happen frequently enough to be intensely irritating. The fix was simple, but researching it was not.

The Fix

Stop using the synaptics driver and start using libinput. For me this was as simple as running aptitude remove xserver-xorg-input-synaptics and rebooting.

Since your system may not be configured exactly as mine was, details follow so you can gain a better sense of what might be going on with your own system.

The Hardware

The Thinkpad t460p includes both a touchpad and a pointing stick.

Thinkpad Pointing Stick

These show up as separate input devices under X11:

$ xinput
⎡ Virtual core pointer                          id=2    [master pointer  (3)]
⎜   ↳ Virtual core XTEST pointer                id=4    [slave  pointer  (2)]
⎜   ↳ SynPS/2 Synaptics TouchPad                id=12   [slave  pointer  (2)]
⎜   ↳ TPPS/2 IBM TrackPoint                     id=13   [slave  pointer  (2)]
[... more output truncated...]

The Drivers

There are multiple drivers potentially in play here. By default the synaptics driver is installed via the xserver-xorg-input-synaptics package, and the libinput driver is also installed by default via the xserver-xorg-input-synaptics package:

$ sudo aptitude search xserver-xorg-input | egrep 'synaptics|libinput'
i  xserver-xorg-input-libinput - X.Org X server -- libinput input driver
p  xserver-xorg-input-libinput:i386 - X.Org X server -- libinput input driver
p  xserver-xorg-input-libinput-dev - X.Org X server -- libinput input driver (development headers)
i  xserver-xorg-input-synaptics - Synaptics TouchPad driver for X.Org server
p  xserver-xorg-input-synaptics:i386 - Synaptics TouchPad driver for X.Org server
p  xserver-xorg-input-synaptics-dev - Synaptics TouchPad driver for X.Org server (development headers)

The synaptics driver takes precedence over the libinput driver for the SynPS/2 Synaptics Touchpad device. This can be confirmed by looking at detailed information for the SynPS/2 Synaptics Touchpad device using the id 12 that we got in our previous xinput command. We can see that the Synaptics driver is in use because each of the properties in the list is prefixed by that driver name:

$ xinput list-props 12
Device 'SynPS/2 Synaptics TouchPad':
        Device Enabled:         1
        Synaptics Edges:                1632, 5312, 1575, 4281
        Synaptics Finger:               25, 30, 256
[... more output truncated...]

The libinput driver is working, though, and is in use by the pointer stick, as we can see by noting the libinput prefix on all the properties associated with id 13 (the TPPS/2 IBM TrackPoint from out initial xinput command).

$ xinput list-props 13
Device 'TPPS/2 IBM TrackPoint':
        Device Enabled (141):   1
        Coordinate Transformation Matrix (143): 1.000000, 0.000000, 0.000000, 0.000000, 1.000000, 0.000000, 0.000000, 0.000000, 1.000000
        libinput Accel Speed (284):     0.000000
        libinput Accel Speed Default (285):     0.000000
[... more output truncated...]

Synaptics and PalmDetect

One of commonly suggested approaches to address touchpad jumpiness while typing is the palm detection feature of the Synaptics driver. This can be configured either via synclient as described in the Arch Linux Wiki for Synaptics, via xinput set-prop, or by editing the xorg config in a file like /etc/X11/xorg.conf.d/50-synaptics.conf.

Enabling PalmDetect had no noticeable impact for me. I didn’t dig enough to determine if the feature was actually broken on my hardware, or if it just addresses a different problem. My palms don’t actually rest on the pad when I type, but the plastic of the laptop case flexes enough that the touchpad interprets it as input. It may be that PalmDetect is correctly detecting that no palm is resting on the trackpad and so allows the bad input though.

Synaptics and syndaemon

A second commonly suggested approach to erroneous touchpad input while typing is syndaemon. Syndaemon monitors xorg for keyboard activity (either by polling frequently or via the more efficient XRecord interface), and when activity is detected it briefly disables the touchpad by doing something roughly equivalent to xinput set-prop 12 "Device Enabled" 0. The Arch Wiki for Synaptics has advice on configuring syndaemon, or it can be added to your Gnome startup applications to run as your normal user on Gnome login.

Syndaemon also had no effect for me. It appears that in systems with multiple pointers, syndaemon only attempts to disable the first device. This issue is reported and confirmed in Ubuntu 1591699. In that bug report, the first pointing device was a “ghost” and could be disabled manually. In my case, there are legitimately two pointing devices present and I use them both, I don’t want to disable either of them. It appears that if you have two pointing devices and your touchpad doesn’t have the lowest xinput id, there is no way to configure syndaemon to supress input from the correct device.

Libinput and DWT

Libinput is a library that handles input devices (both keyboard and pointer devices) for Wayland, but as we found in our drivers section above, libinput is installed by default and works for xorg systems as well. Libinput has a disable-while-typing feature built in and enabled by default.

I was able to activate libinput for my Synaptics touchpad simply by uninstalling the synaptics driver and rebooting. The libinput DWT feature began working immediately and my pointer became inactive while I was typing. Problem solved!

Libinput and Right-Click

The synaptics driver also has features to divide the touchpad area into sections that trigger different buttons, and by default the right-half triggers a right-click when depressed. Libinput uses the whole touchpad for left-clicking, which is better behavior in my opinion. For right-clicks I use the hardware-button just above the trackpad.

It is allegedly possible to configure the synaptics driver to disable the right-click area but I never tried this myself as libinput has all the behaviors I want.

Note that the Thinkpad t460p’s touchpad acts as a hardware button, physically clicking when the pad is depressed. Both libinput howto’s above talk about how to enable the Tapping feature, which is not necessary for that hardware button to function. I suspect you only need the Tapping option if you want light taps to register as clicks, which seems unnecessary and undesirable for this hardware.

From Chef/LXC to Ansible/Docker

Introduction

I recently changed the way I manage the handful of personal servers that I maintain. I was previously using Chef to provision containers on LXC. I’ve recently switched over to a combination of Ansible and Docker. So far, I’m happy with the switch. Going through the process, I feel like I’ve learned something about what each technology does well. The TLDR is:

  • Chef remains my favorite system for high-volume high-complexity configuration management work. Dependency management, test tooling, and the comfort and power of the language are all exceptional. But Chef itself is not low-maintenance, and the overhead of keeping a development environment functional dwarfs the “real” work when you have just a few hours of infrastructure automation to do each month.
  • The Ansible ecosystem is slightly less capable than Chef’s in almost every way, or at least I like using it less. It’s still really really good, though. It’s also simple to set up and never breaks. If you only do a little infrastructure automation, Ansible’s simplicity is ideal. If you do grow to have a very complex environment, Ansible will scale with you. I might be slightly happier managing many tens or a few hundred cookbooks in Chef, but I could certainly get the same job done in Ansible.
  • Dockerfiles are a massive step backward from both Ansible and Chef in every way. Most of the work done in Dockerfiles is so trivial that sloppy bash and sed for text-replacement is good enough, but it’s not good. I’ve found images on Docker Hub to do everything I want to so far, but when I need to write a nontrivial Dockerfile I’ll probably investigate ansible-container, or just use Ansible in my Dockerfile by installing it, running it in local-mode, and removing it in a single layer.
  • Though I don’t like the Docker tools for building complex images, I do like that it encourages (requires?) you to be much more rigorous about managing persisent state in volumes. For me Docker’s primary draw is that it helps me succeed at separating my persistent state from my software and config.

Read on for the details.

Your Mileage May Vary

I’m not advocating my own workflow or toolset for anyone else. I’m working in a toy environment, but my experiences might help inform your choices even if your needs are fairly different than mine.

My Environment

I’m doing all this to manage a handful of systems:

  1. A single physical machine in my house that runs half a dozen services.
  2. The Linode instance running this webserver.
  3. Whatever physical, virtual, or cloud lab boxes I might be toying with at the minute.

It’s fairly ridiculous overkill for a personal network, but it gives me a chance to try out ideas in a real, if small, environment.

From LXC

When I was using LXC, I used it only on the physical box running multiple services, not the Linode or lab boxes. Because the physical box ran a bunch of different services, I wanted to isolate them and stop worrying about an OS upgrade or library version upgrade from one service breaking a different service. I chose LXC rather than Xen or VirtualBox because I was memory constrained and LXC containers shared memory more efficiently than “real” virtualization. I didn’t have to allocate memory for each service statically up-front, each container used only what it needed when it needed it. But each container was a “fat” operating system running multiple processes, with a full-blown init-system, SSH, syslog, and all the ancillary services you’d expect to be running on physical hardware or in a VM.

LXC did it’s job smoothly and caused me no problems, but I found I wasn’t any less nervous to do upgrades than before I had split my services into containers. Although my deployment and configuration process was automated, data backup and restore was as much of a hassle as it had always been. And in many cases, I didn’t even really know where services were storing their data, so I had no idea if I was going to have a problem until I tried the upgrade.

LXC does have a mechanism to mount external volumes, but it was manual in my workflow. And my experience with LXC plugins for Chef Provisioning and Vagrant was that they weren’t terribly mature. I didn’t want to try to attempt automating volume configuration in LXC, which set me thinking about alternatives.

To Docker

Docker has great volume support and tons of tooling to automate it, so I figured I’d try migrating.

I was able to find existing images on Docker Hub for all the services I wanted to run. The Dockerfiles used to build these images didn’t leave a great impression compared to the community Chef cookbooks they were replacing. They were much less flexible, exposing many fewer options. The build processes hardcoded tons of assumptions that seem like they’ll make maintenance of the Dockerfile flaky and brittle in the face of upstream changes. But they do work and they seem to be actively maintained. When an image failed to set the environment up as I desired, I was generally able to hack an entrypoint shellscript to fix things up as I desired on container startup. Where configuration options weren’t exposed, I was generally able to override the config-file(s) entirely by mounting them as volumes. It all feels pretty hacky, but each individual hack is simple enough to document in a 2 or 3 line comment, and the number of them is manageable.

By trading off the elegance of my Chef cookbooks for the tire fire of shell scripts defining each container, I’ve gained confidence that my data as well as my configs will be available in each container after upgrade. I’ve already killed and recreated my containers dozens of times in the process of setting them up, and expect to be able to do upgrades and downgrades for each container independently with the minimum necessary hassle.

From Chef

When I was using Chef, I used it manage all my systems. I used it to set up LXC on my container host, to manage the services running inside of each LXC container, to set up the web-service on my Linode, as well as to manage whatever ephemeral lab boxes I was messing with at the moment.

To launch a new service in an LXC container, I would manually launch a new LXC container running a minimal Ubuntu base-image. At the time, the tools I tried using to automated LXC generally had missing features, were unreliable, or both… so I stuck to the bundled command-line interface. Each container would have its own ip-address and DNS name, which I would add to my Chef Provisioning cookbooks as a managed host to deploy my software and configs to the container over SSH. Chef Provisioning would run a wrapper-cookbook specific each node that:

  1. Called out to a base-cookbook to set up users, ssh, and other things that were consistent across all my systems.
  2. Generally called out to some community cookbook to install and configure the service.
  3. Overrode variables to control the behavior of the above cookbooks, and did whatever small tweaks weren’t already covered.

I used Berkshelf manage cookbook dependencies, which is a fantastic system modeled closely on Ruby’s bundler gem tool, and both tools have a powerful and flexible approach to dependency management.

The custom-cookbooks that I wrote had extensive testing to let me iterate on them quickly:

  • rubocop ran in milliseconds and ensured that any Ruby glue code in my system was syntactically valid and reasonably well styled.
  • Foodcritic similarly ran in milliseconds and ensured that my cookbooks were syntactically valid Chef code and reasonably well styled.
  • Chefspec unit tests ran in seconds and helped me iterate quickly to catch a large fraction of logic bugs.
  • test-kitchen and serverspec ran my cookbooks on real machines to provide slow feedback about the end-to-end behavior of my cookbooks in a real environment.
  • guard to automatically ran the appropriate tests whenever I saved changes to a source file.

When everything was working, I was able to iterate my cookbooks quickly, catch most errors without having to wait for slow runs against real hardware, enjoy writing Chef/ruby code, and have a lot of confidence in my deploys when I pushed changes to my “real” systems. The problem was, everything almost never worked. Every time I upgraded anything in my Chef development environment, something broke that took hours to fix, usually multiple somethings:

  1. Upgrading gems inevitably resulted in an arcane error that required reading the source code of at least 2 gems to debug. I ended up maintaining personal patches for 6 different gems at one point or another.
  2. ChefSpec tests regularly failed when I upgraded Chef, gems, or community cookbooks. Again, the errors were often difficult to interpret and required reading upstream source-code to debug (though the fixes were always mechanically simple once I understood them). I really like the idea of ChefSpec providing fast feedback on logic errors, but on balance, I definitely spent more time debugging synthetic problems that had no real-world implication than I spent catching “real” problems with ChefSpec.
  3. Using LXC targets with test-kitchen was amazingly fast and memory efficient, but also amazingly brittle. The LXC plugin for test-kitchen didn’t reliably work, so I ended up using test-kitchen to drive vagrant to drive LXC. This setup was unreasonably complicated and frequently broke on upgrades. This pain was largely self-inflicted, test-kitchen can be simple and reliable when run with more popular backends, but it was frustrating nonetheless.
  4. It’s idiomatic in Chef to store each cookbook in it’s own independent git repo (this makes sharing simpler at large scale). Gem versions, cookbook versions, and test configs are stored in various configuration files in the cookbook repository. This meant each upgrade cycle had to be performed separately for each cookbook, testing at each step to see what broke during the upgrade. Even when it went well, the boilerplate for this process was cumbersome, and it rarely went well.
  5. Chef Provisioning was another self-imposed pain-point. Chef provisioning over SSH has been reliable for me, but it’s overkill for my basic use-case. When I started with it, it was very new and I thought I’d be learning an up-and-coming system that would later be useful at work. In fact, it never got a huge user-base and I switched to a job that doesn’t involve Chef at all. It ended up being a bunch of complexity and boilerplate code that could have easily been accomplished with Chef’s built-in knife tool.

ChefDK can help with a lot of these problems, but I always found that I wanted to use something that wasn’t in it, so I either had to maintain two ruby environments or hack up the SDK install, so I tended to avoid it and manage my own environments, which probably caused me more pain than necessary in the long run. When I found something that didn’t work in the ChefDK, I probably should have just decided not to do things that way.

But regardless of whether you use the ChefDK or not, the cost of these problems amortizes well over a large team working on infrastructure automation problems all day long. One person periodically does the work of upgrading, they fix problems, lock versions in source control, and push changes to all your cookbooks. The whole team silently benefits from their work, and also benefits from the ability to iterate quickly on a well-tested library of Cookbooks. When I was working with Chef/Ruby professionally, the overhead of this setup felt tiny and things I learned were relevant to my work. Now that I’m not using Chef/Ruby at work, every problem is mine to solve and it feels like a massive time sink. The iteration speed never pays off because I’m only hacking Chef a few hours a month. It became hugely painful.

To Ansible

Although I migrated many of my services to Docker, I haven’t gone all-in. Minimally, I still need to configure Docker itself on my physical Docker host. And more generally, I’m not yet convinced that I’ll want to use Docker for everything I do. For these problems, I’ve decided to use Ansible.

In most ways, this is a worse-is-better transition.

  • Ansible Galaxy seems less mature than Berkshelf for managing role dependencies, and the best practice workflow certainly seem less well-documented (do you check in downloaded roles, or download them on-demand using requirements.yml, what’s the process for updating roles in each case?).
  • Standard practices around testing Ansible roles seem way less mature compared to what I’m used to in the Chef community, and seem mostly limited to running a handful of asserts in the role and running the role in a VM.
  • The yaml language feels less pleasant to read and write than Ruby to me, though practically they both work for my needs.
  • I won’t speak to Ansible’s extensibility as I haven’t attempted anything other than writing roles that use built-in resources.

Even though I feel like I’m accepting a downgrade in all the dimensions listed above, Ansible is good enough at each of those things that I don’t really miss the the things I like better about Chef. And the amount of time I spending fixing or troubleshooting Ansible tooling can be effectively rounded to zero. This simplicity and reliability more than makes up for the other tradeoffs.

I now have each of my handful of physical/cloud hosts defined in my Ansible inventory file and my Docker containers are defined in a role using docker_image and docker_container. Perhaps someday I’ll migrate to using Docker Compose but for now this is working well.

Testing is simple and basic, I have a Vagrantfile that launches one or two VirtualBox instances and runs whatever role(s) I’m current hacking on them (including the Docker roles if necessary). Testing a change takes a minute or two, but things mostly work the first time and when there is a problem the fix is usually simple and obvious. Even though my feedback cycle is slower with Ansible, I find that iteration is faster because I’m working on the problem I’m trying to solve instead of yak-shaving six degrees of separation away.

Conclusion

I miss writing Chef cookbooks, it’s still my favorite configuration management system. The overhead of maintaining simple things in Chef is just too high, though, and the power of its ecosystem can’t offset that overhead on small projects. My life with Ansible and Docker feels messier, but it’s also simpler.

I’ve also come to appreciate that while having a really sophisticated configuration management and deployment system is great, it does you precious little good if your management of persistent state across upgrades and node-replacement isn’t similarly sophisticated. Building images with Dockerfiles feels like a huge step backward in terms of configuration management sophistication, but it’s a huge step forward in terms of state management, and that’s a tradeoff well worth making in many situations.

Goodbye Wordpress!

Introduction

After more than 12 years it’s time to say goodbye to Wordpress. It’s been a good run and WordPress is fantastic software but I spend considerably more time maintaining it than I do writing. A static site can do everything I want and needs way less maintenance when I’m not using it. I’ve switched over to Hugo and am relatively happy… though there were some minor bumps and bruises along the way.

Yearly and Monthly Archives

If you put the year or month in your permalink structure, Wordpress automatically creates yearly and monthly archive pages. For example, if your permalink structure is http://example.com/:year:/:month:/:slug:/, you can visit http://example.com/2017/06/ and see a list of postings from that month. I’m probably unusual, but I like to navigate sites this way and I want my site to have reasonable archive pages at those year and month urls.

Hugo can’t yet do this. It’s relatively straightforward to use Hugo’s templating features to create a single archive page that links to every post, but the per-year and per-month urls are important to me.

I wasn’t able to use Hugo to solve this problem, but most webservers do have the ability to automatically display the contents of a directory, which is already what Hugo generates. I configured my Caddy webserver to do this and it works ok. The generated page style is inconsistent with the rest of the site but Caddy does allow styling those pages if I choose to do so later. More likely I’ll live with the default style until the Hugo issue is resolved and then start generating monthly/yearly archives with Hugo.

Extensibility

Hugo is written in the Go programming language, which is a relatively young language that prominently features static and self-contained binaries. I’m a huge fan of Go’s static binaries. A large part of the reason I picked Hugo over Jekyll is ease of installation and upgrade (just download one binary and run it). But one downside of the self-contained nature of Go programs is that plugin systems are tricky to create. Hugo doesn’t have one yet. The lack of a plugin ecosystem does limit what Hugo can do compared to systems like Jekyll, but my needs are relatively simple and it hasn’t been a major issue.

Theme Inconsistency

The Hugo theme ecosystem seems immature compared to what I’m used to from the world of WordPress. WordPress has well-developed conventions for how themes are configured. In contrast, the Hugo theme ecosystem seems to have few broadly adopted conventions. Many Hugo themes don’t support every site layout that Hugo can generate, but instead assume that your site content adopts a specific category or filesystem layout. These limited/opinionated themes combined with my ignorance of Hugo’s site-layout conventions to create several confusing moments when the site rendered with missing content or in other unexpected ways. Only after reading Hugo’s site organization docs closely and poring through theme source code did I come to understand why things weren’t rendering as expected.

WordPress also has easy to use mechanisms for extending themes via plugins and widgets. With Hugo, themes themselves are the only mechanism for extending Hugo’s capabilities. Hugo does allow you to add and override things in your theme on a file-by-file basis without having to edit the upstream theme directly, which is relatively powerful but there’s an art to factoring a theme into easily overridden files and maintenance can be unpleasant if you end up having to override something in a large/core file in the theme. If you’re a front-end developer maintaining your own theme, none of this matters in the least. If you want to do light customization of an existing theme, minimizing maintenance headache so you can update the upstream theme easily is a little finicky.

I chose hugo-theme-bootstrap4-blog for my theme and have been happy. It has clear documentation about the content layout it expects, it provides config.toml options for most things I want to customize, and the maintainer has been responsive to pull requests to add the features I wanted without my having to keep a fork that deviates from upstream.

Migration

Thankfully, migrating my data was not terribly difficult. I read this post on migrating data from WordPress to Hugo and was able to use a combination of WordPress’s built-in export-to-xml feature and the ExitWP tool to convert my WordPress database to a skeleton Hugo site. I was able to keep my permalink structure the same, and I was already hosting files and images at static non-WordPressy URLs that didn’t change. The only url change I found was that I had my RSS feed at /feed/ and I added a webserver redirect from there to /index.xml where Hugo puts the feed.

Conclusion

It wasn’t the smoothest migration in the world, and Hugo had a non-trivial learning curve for me… but I’m happy with the result. Writing posts is dead-easy and when I’m not writing there’s no maintenance to do.

Filling up the Boot Partition

Ubuntu doesn’t remove old kernels when upgrading to new kernel versions, which is a great because sometimes there’s a compatibility problem and you want to roll back. If you don’t pay attention to free disk space, though, it’s really easy to fill up your boot partition which is only a couple hundred megs by default. When this happens, kernel upgrades start failing and apt may start throwing errors for all package operations, which isn’t fun. It’s relatively straightforward to recover from, but it happens infrequently enough that it always takes me too long to remember the details. Next time I’ll check here:

  1. Find the currently running kernel version, I never uninstall the currently running kernel in case there are compatibility issues with newer kernels:

     uname -r
    
  2. For each older kernel you want to remove:

     sudo dpkg --purge linux-image-x.x.x-xx-generic \
                       linux-image-extra-x.x.x-xx-generic
    

    You can do this with apt-get or aptitude as well, but dpkg is less likely to run into problems with broken dependencies or half-configured kernel installs (as is common if you discover this problem while upgrading your kernel).

  3. I almost always also have kernel headers installed. While they don’t take up space in /boot, they’re not needed once the old kernel is removed either. Might as well clean them up as well:

     sudo dpkg --list | grep linux-headers
     sudo dpkg --purge linux-headers-x.x.x-xx \
                       linux-headers-x.x.x-xx-generic
    
  4. At this point apt-get can probably finish whatever installs/upgrades were in-flight when this issue started:

     apt-get --fix-broken install
    

Now to set up free disk space monitoring so this doesn’t happen every few months.

Chefspec 3 and Guard Evaluation

Introduction

Chefspec is an incredibly useful tool for testing Chef cookbooks. It’s much much faster than running chef on a real node, but it can provide you much of the testing feedback you’d get from a real chef run. Verious.com has a nice introduction to chefspec if you’re not already familiar with it.

What makes chefspec so fast is that it doesn’t perform a full chef run. It loads Chef with your cookbooks and modifies them in-memory so that they merely send messages to Chefspec instead of performing real system changes. It does this primarily by stubbing Chef’s Resource class. In Chef, just about every way to manipulate system state is a resource. Most of them have excellent metadata about the actions they will perform (files have filename attributes, packages have package names) and they all share common methods for executing the work, so it’s surprisingly straightforward for Chefspec to stub the “doing work” part so it performs no action, while retaining the ability to effectively test for what would have been done.

Execute Blocks

This process is nothing short of amazing for Chef built-in resources like files, templates, packages, etc. It’s fast, it’s accurate (excepting bugs in core Chef that result in unexpected actions due to the “doing work” code), and it’s simple to use. But it does have limits. A good chunk of Chef’s flexibility comes from the ability to run custom code in execute blocks and ruby blocks:

execute "thing1" do
  command "rm -rf /"
end
execute "thing2"; do
  command "find / -print0 | perl -0ne unlink"
end
execute "thing3" do
  command "python -c \"import shutil; shutil.rmtree('/')\"";
end

Chefspec is pretty limited in what it can tell you about execute blocks. There’s no way it can know that the 3 execute blocks above all do the same thing (delete the root filesystem), or that it’s not safe to run those commands on your development workstation. Out of the box, it’s largely limited to testing whether or not the execute block is called.

Guards

But even reliably determining if an execute block will run is not trivial. The not_if and only_if guards used to determine whether the block runs present similar problems to the execute block itself:

execute "create_database_schema" do
  command "mysql -u user -p password dbname > create_schema.sql";
  not_if "echo 'show tables;' | mysql -u user -p password dbname | grep tablename"
end

The not_if guard above will give unexpected results if the mysql binary is missing from the system where you run chefspec. Chefspec 2.x sidestepped the issue. It didn’t execute guards by default, and simply assumed that the execute block itself would always run… not ideal. Chefspec 3 does inspect the guards, but rather than executing the commands inside of them, it raises an error requiring you to stub it yourself like so:

it "Creates the Database Schema when needed" do
  stub_command("echo 'show tables;' | mysql -u user -p password dbname | grep tablename").and_return(false)
  expect(chef_run).to run_execute('create_database_schema')
end
it "Doesn't create the Database Schema when it already exists" do
  stub_command("echo 'show tables;' | mysql -u user -p password dbname | grep tablename").and_return(true)
  expect(chef_run).to_not run_execute('create_database_schema')
end

This is a pretty clean example. In practice, guards frequently contain wacky stuff. It’s not unusual to leverage a couple shell commands and do some ruby transformations on the resulting complex data type, possibly requiring several stub classes to stub a single guard. If you include several upstream cookbooks, you may have a substantial amount of stubbing ahead of you before chefspec 3 will run cleanly.

Test Coverage

The Chefspec 3 convention of encouraging the stubbing of not_if and only_if guards results in covering more of your Chef code with unit tests, and that’s a great thing. It comes with a non-trivial cost, though. Having to stub the code in included cookbooks in order to test your own code isn’t fun. With chefspec 2.x, I accepted a very low level of code coverage from chefspec, using it only to test “well-behaved” resources that required little to no stubbing. My complete testing workflow looks like this:

  • Syntax and style testing with Rubocop.
  • Chef syntax checking with knife cookbook test
  • Fast low-coverage unit tests with chefspec
  • Slow, high-coverage integration tests with minispec-handler (either via Vagrant provision while I’m hacking or test-kitchen in Jenkins/Travis)

Because the integration environment that Chef works is in so much more complex than most (non-infrastructure-automation) code, I prefer to invest in having a strong integration test suite in minitest-handler rather than spending a lot of time stubbing and mocking in chefspec. I still want to use Chefspec to catch low-hanging fruit because my integration tests are very slow by comparison, but I’m willing to accept a relatively low level of unit-test coverage. If I was doing lots of LWRP development or otherwise had to test very complex logic in Chef, I’d need stronger unit testing, but 90% of my Chef code is straightforward attributes, resources, and includes so integration testing is where a lot of my bugs are easiest to find.

Skipping Guard Evaluation

Which is a round about way of saying, I like the chefspec 2.x behavior of skipping guard evaluation. It results in a less robust unit-test suite, but I make up for it with my integration tests. If you prefer the same tradeoff, you can get the old behavior back by stubbing Chef’s resource class yourself:

require 'chefspec'
describe 'example_recipe' do
  let (:chef_run) { ChefSpec::Runner.new(platform:'ubuntu', version:'12.04').converge 'example_cookbook::example_recipe' }
  before(:each) do
    # Stub out guards so that execute blocks always "run"
    Chef::Resource::Execute.any_instance.stub(:should_skip?).and_return(true)
  end
  it 'Creates the database schema' do
    expect(chef_run).to run_execute('create_database_schema')
  end
end