🔥Let’s Do DevOps: GitHub Runner in Docker, Always Correct Version

This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can…

Sep 01, 2021

This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can do it!

Hey all!

I’ve written extensively about how we’ve migrated from static VM hosts to scale-sets, then to docker for our Azure DevOps builders. We are also preparing GitHub as a platform for our internal teams, and one challenge has been the builders, or as GitHub calls them, Self-Hosted Runners. These are VMs or containers which we host in our internal network, and register to our GitHub in order to receive and run compute jobs (defined as Actions files).

We have several goals for these Runners:

Container-driven — Container technologies permit the following goals in short order
Single use, then destroy — this prevents cross-job local artifact leaking
Frequently and automatically patched — This allows for very frequent patching
Infra as Code — This permits tracking all configuration and changes as code, for easy understanding of failure tracing

We did all that, and I have a video post here where I talk about how we did all that and why these goals are what they are.

But I want to talk about a specific issue we saw and how we tamed it.

Docker Process

Docker has two parts:

Build: This process is defined by a Dockerfile that has a list of instructions, that are executed in order, to generate an image. That image is stored in a container registry and is accessed by run step via tag, to run the container.
Run: The image is downloaded from the container registry, then an “entrypoint” script is run. The execution tools monitor the entrypoint script, and once it exists, they notice and can take an action, which could be to spin up a replacement host.

For our builders, the build process installed a specific version of the Runner software. It looks like this:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	# GitHub Runner Version
	ARG RUNNER_VERSION="2.280.3" #causes issue to specify, need to update to auto/latest
	# GitHub Runner Installation
	RUN cd /home/githubuser && mkdir actions-runner && cd actions-runner \
	&& curl -O -L https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz \
	&& tar xzf ./actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz
	# install some additional dependencies
	RUN /home/githubuser/actions-runner/bin/installdependencies.sh

view raw docker_install_github_runner_specific.dockerfile hosted with ❤ by GitHub

Note that we’re installing a specific version of the GitHub Runner software. We’ve done with our Azure DevOps (ADO) builders for several months, and it works great. If there’s a new bug fix we want in the runner software for ADO, we update the dockerFile, build the image, and push it to our registry.

However, the GitHub Runners don’t work that way. They absolutely won’t permit a non-current version of their Runner software to register. They automatically and immediately trigger an update to the newest version.

On a VM, that’s great. The process would download the new version, restart to install it, then start it again. It only takes a minute, and you’re back up and running.

However, in a container, that process restart means the “entrypoint” process that the execution software runs fully stops after the download is completed, while the process restarts to update. The container execution tool immediately notices the process it launched has stopped and interprets this as a terribly failure. Its whole job is to keep that process running! So it immediately terminates the instance and launches a new one.

The new instance, of course, does the same thing. It tries to register, sees the new version of the Runner, then restarts, and the execution tool does the same thing.

In effect, we’ve created a complicated boot-loop, which isn’t great.

Our Initial (FireDrill) Fix

When we initially ran into this issue, we’d see that everything was running fine. Jobs were running, the containers would die after each job, and a new Runner would spin up and register. Perfectly what we want.

Then a new version of Runner would be released, which is a frequent process, maybe 1 time a week or so. And suddenly our Runners wouldn’t be able to register! Oh no!

We’d do a PR against our Dockerfiles to update the version of Runner we are installing, then get it merged and built (which can take some time!), then deployed, and the next time the execution tool pulls the image, it is the new version, and it’s able to register.

Which is a relatively quick fix, but makes our whole Runner infrastructure brittle and suddenly fall over without warning when a vendor releases a new tool. Not a very stable architecture for our internal clients.

GitHub’s Official Solution

GitHub’s customers noticed this long before we did, and GitHub built a sort of hacky solution. They created a service wrapper that wraps the Runner service, and permits updates. Which is great, problem solved, right? Well, no. It does permit upgrades in containers without a restart but doesn’t yet support the “run-once” behavior of Runners.

runner/runsvc.sh at master · actions/runner
The Runner for GitHub Actions :rocket:. Contribute to actions/runner development by creating an account on GitHub.github.com

We require that functionality, so that’s an immediate no-go from our business. We’re in a regulated industry (healthcare), so keeping local artifacts in our job running separately is a big deal.

What else can we do?

Our Solution #1: Entrypoint Install!

My good friend and fellow architect at work, Sai Gunaranjan, wrote a solution for this. His idea is that as part of the launch process, we can check the latest version of the Runner, download, and install it.

We retain our ability to --run-once, and we always have the latest version of the Runner installed. Even if the Runner software is released mid-day, each time the container is launched, it will still download and install the newest version of the software. It looks like this in our entrypoint bash script:

Show hidden characters

	RUNNER_VERSION="${RUNNER_VERSION:-$(curl --silent "https://api.github.com/repos/actions/runner/releases/latest" \| grep tag_name \| sed -E 's/."v([^"]+)"./\1/')}"
	curl -O -L https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz
	tar xzf ./actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz
	./bin/installdependencies.sh

view raw docker_entrypoint_ghe_runner_install_latest.sh hosted with ❤ by GitHub

This is particularly cool. We do a curl to the GitHub api to grab the Runner public repo release info and filter for the very latest version. Then we use that to download the latest version and install it.

This is great… except. It takes a significant amount of time to do this, and this is on every launch. On my local machine, it’s about 50–60 seconds. On our ECS execution on AWS, it takes 2–3 minutes. That’s a long time of waiting on a build. It’s just too long. We need a better way.

Our Solution #2: Docker Build Install, Entrypoint Check

I decided to combine both solutions together:

On Docker build, we install the very latest Docker Runner software. Remember, this is frequent, as infrequently as once per day
On run, our entrypoint script checks to see if the installed version matches the latest version
If we are running the latest version (almost always), we register happily and start running jobs :) This is our fast, happy path.
If we aren’t running the latest version (new Runner software released since build), we download and install the newest version. This is a slower path, but will be infrequent, only installs when needed.

First, we update our dockerFile so that builds always install the very latest version of Runner software:

Show hidden characters

	# GitHub Runner Installation - Install Latest
	RUN mkdir /actions-runner && cd /actions-runner \
	&& RUNNER_VERSION=$(curl --silent "https://api.github.com/repos/actions/runner/releases/latest" \| grep tag_name \| sed -E 's/."v([^"]+)"./\1/') \
	&& curl -LO https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz \
	&& tar xzf ./actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz \
	# Install GitHub Runner dependencies
	&& /actions-runner/bin/installdependencies.sh

view raw ghe_runner_install_latest.dockerfile hosted with ❤ by GitHub

This means that as of the build time, this is the latest version of Runner software. However, sometimes a new version is released the next day before we build again. We need to check for that in our entrypoint script and install it if a new version is released. If not, though, we need to quickly register, to avoid the long delay (several minutes) of installing on each boot.

The simple entrypoint bash script looks like this. First, we find the installed version using a local command, then we check the latest released software using curl, and store both in variables.

Show hidden characters

	# Find installed version and latest version
	INSTALLED_VERSION=$(/actions-runner/config.sh --version)
	LATEST_VERSION=$(curl --silent "https://api.github.com/repos/actions/runner/releases/latest" \| grep tag_name \| sed -E 's/."v([^"]+)"./\1/')

view raw ghe_entrypoint_find_versions.sh hosted with ❤ by GitHub

Then we use a simple bash script to compare the variable values. If they match, print a happy message and move right along to registration. If they don’t, we download the latest version and install it.

Show hidden characters

	if [ "$INSTALLED_VERSION" = "$LATEST_VERSION" ]; then
	echo "********"
	echo "GitHub Runner version installed is already latest"
	echo "********"
	else
	echo "********"
	echo "GitHub Runner version out of date, updating to newest"
	echo "********"

	# Download and install latest version of runner software
	curl -O -L https://github.com/actions/runner/releases/download/v${LATEST_VERSION}/actions-runner-linux-x64-${LATEST_VERSION}.tar.gz
	tar xzf ./actions-runner-linux-x64-${LATEST_VERSION}.tar.gz

	# Install dependencies
	./bin/installdependencies.sh
	fi

view raw ghe_entrypoint_compare_versions.sh hosted with ❤ by GitHub

And that’s it, we get the best of both worlds:

Quick boots when our Runner software is up to date
Upgrades to the required latest version when required

Summary

And that’s how we’ve solved the problem for now. We’re looking forward to an “ephemeral” mode for Runners, which will be the process wrapper for docker that permits Runner software to update itself without causing the container to restart.

But until then, this works pretty well! And requires no more fire drills where suddenly our whole Runner infra suddenly falls over. Back to martinis.

Thanks all! Good luck out there.
kyler

Let's Do DevOps

Discussion about this post