🔥Let’s Do DevOps: Share ECR Docker Image and Secrets Between AWS Accounts

Aug 09, 2021

This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can do it!

Hey all!

As we’ve scaled out our CI/CD (~175 pipelines across 75 accounts) to have many builders (~25 pools, ~50 or so builders), we’ve seen pain points where the way we were doing it before just wasn’t cutting it. I’ve written extensively about how graphical, manually-managed pipelines just couldn’t scale beyond about 50 — I’d spend literally all day every day just updating Terraform versions, and double-checking store accounts, and no one wants that job.

We eventually wrote out our pipelines in YAML, and now manage them via pull requests in a git repo. That permits variables (to conclusively match storage accounts and account IDs among steps in a pipeline), and also mass updates — find/replace for terraform versions. That has been a roaring success, and we’ve adopted that model all over.

This blog focuses on our builders — as we’ve scaled out our AWS accounts, we’ve been registering a single small Amazon Linux host to be a runner in each account. That is a boon to authentication security — it can use an assumed IAM role within the account to manage it, rather than authenticating with a static IAM user, but a pain for management and inter-pipeline security. Let’s talk about each separately.

First, management. These are long-lived ec2 hosts that need to be monitored, rebooted occasionally, and patched. That overhead is an anti-pattern for the containerized, serverless models of modern applications. So it’s gotta go.

Second, inter-pipeline security. Pipelines in CI/CDs often have access to privileged information — they connect to secrets stores like SSM or Vault, sometimes copying those secrets to the local disk when doing compute or ETL operations. Now image those artifacts aren’t cleaned up, and a second malicious pipeline runs on the same builder and uploads every file it can get access to to somewhere else. That’s not great. There are mitigations you can build in, like each pipeline cleaning its artifacts, or making sure all teams running jobs don’t store secrets on disk, but that’s very much a herding-cats model. It’s gotta go.

New Model Goals

I spent a lot of time mapping out a new model with another architect, Sai Gunaranjan, and figuring out what our goals are. Here’s what we need to satisfy:

Container or serverless driven, no servers! Containers have myriad benefits, including easy relaunch on issues, and can be frequently rebuilt to include security patches
Runners should run a single job only, and then die, so no risk of leaking secrets between jobs
Centralized repository and secret access, so there isn’t a continued need to update secrets and docker image in many locations
Easy to deploy, with a robust, standardized Terraform module

New Model: Centralized ECR and Secrets, ECS Runners

We have to solve this in both the Azure and AWS space for business reasons. I know the AWS side better, so I own that. I elected to create a “Hub” account that stores all the non-duplicated items (ECR, secret), and have many “spoke” accounts that will store their own ECS service, task definition, and scheduling.

In the single “Hub” account, I created several resources:

An ECR to store the builder container image that spoke accounts will pull, and created an ECR policy to permit inter-account access
A secrets manager “secret” that I populated with the “PAT” token (used to authenticate to Azure DevOps (ADO) and register as a builder and a secret policy permitting inter-account access
A KMS CMK, a customer-created and managed encryption key, which is used to encrypt the secret, and also a KMS policy which permits “spoke” accounts to access this CMK so they can decrypt the secrets manager secret

In each runner “Spoke” account we call a terraform module that creates:

Cloudwatch log group — Most AWS services optionally use logs, and need somewhere to store them
An IAM “execution” role (and policies) used to pull the ECR image and access secrets
An ECS Fargate cluster — Contains almost no config, but defines underlying hardware type, here “FARGATE”
An ECS task definition that defines how large the runner and underlying Fargate host should be, as well as container location and environment (and secret) information
An ECS service to run n number of the above ECS task and handle replacing it after it dies
An autoscaling target to link autoscaling policies with the ECS service
Autoscaling schedules to spin up more runners during business hours when they are most active, and fewer after-hours, and also 0 out the runner pool each night to trigger mandatory container redeployments each morning — I’ll talk about the logic here more later in this blog

AWS Cross-Account IAM Access

AWS has a much-maligned, and very finicky security model. It’s incredibly powerful, but also so complex it’s sometimes painful to work with.

The model basically works from both sides of access:

A target resource that is receiving a request must have an IAM policy permitting it to receive that request from the sending resource
A resource sending the request must have a policy that permits it to send the request to the target

Keep that in mind as I show the specific config below, and we’ll talk about both sides.

There’s also an interesting catch-22 here — whenever you update an IAM policy, it checks to make sure the “principal” (resource) exists. This means that you can’t proactively create the hub policy before creating the spoke IAM resources, because the hub IAM policies will say the spoke IAM roles don’t exist, so the principal is invalid. This means we need to follow an interesting deployment strategy:

First, deploy the Hub resources without policies. These need to exist so spoke IAM can be created
Second, deploy the IAM execution roles in each Spoke account, with policies to permit access to the Hub resources
Third, deploy policies in the Hub account to permit spoke access
Fourth, deploy the ECS service and task in the Spoke accounts, which should now spin up and succeed

This is a huge bummer in terms of parallel deployment — if your account management is account-based, like ours is, you’ll have issues. If you’re managing your multiple accounts with Terraform workspaces (or CloudFormation?), you could stage the deployment, and manually do each of the four steps in the proper order.

I’ll walk through the resources and policies together, but keep in mind the deployment steps above for when you build it yourself.

Azure DevOps Builder Image

I’ll write another blog for how we designed, built, and tested the docker image. For now, assume the docker image works well, and accepts the following 3 environmental variables to configure itself:

AZP_URL — The location of your Azure DevOps Org, e.g. https://dev.azure.com/foobar
AZP_POOL — The name of the existing pool on Azure DevOps to register into. NOTE: Make sure this pool exists before attempting to programmatically register to it — hosts can’t create a pool when they register.
AZP_TOKEN — a secret token permitting hosts to register to builder pools. This token is created under a specific administrator user (or service principal) and expires periodically. More info here.

Hub Account Resources

ECR and Policy

First, we need somewhere to store our docker image file. AWS permits only a single image to be stored in a single ECR — I like this model, compared with Azure’s “many images in a single ACR.” ECRs have a name, and not much else. I recommend enabling “scan_on_push” to get AWS’s built-in image security scanning and reporting.

Mutability is a topic of much discussion among the docker community. Flexibility-heavy orgs leave MUTABLE on, and this permits over-writing tags with a new version that should be used. Structured, or security-heavy orgs, turn off mutability and often use image tags to track versions, and once an image tag is registered it can’t be over-written. Totally up to you and your org!

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

	resource "aws_ecr_repository" "hub_ecr_repository" {
	name = "hub_ecr_repository"
	image_tag_mutability = "MUTABLE"
	image_scanning_configuration {
	scan_on_push = true
	}
	tags = {
	name = "hub_ecr_repository"
	terraform = "true"
	}
	}

	resource "aws_ecr_repository_policy" "hub_ecr_repository_policy" {
	repository = aws_ecr_repository.hub_ecr_repository.name
	policy = jsonencode(
	{
	"Version" : "2008-10-17",
	"Statement" : [
	{
	"Sid" : "AllowSpokeAccountsToPull",
	"Effect" : "Allow",
	"Principal" : {
	"AWS" : [
	"arn:aws:iam::1234567890:root"
	]
	},
	"Action" : [
	"ecr:GetDownloadUrlForLayer",
	"ecr:BatchGetImage",
	"ecr:BatchCheckLayerAvailability"
	]
	}
	]
	}
	)
	}

	resource "aws_kms_key" "hub_secrets_manager_cmk" {
	description = "KMS CMK for Secrets Manager"
	policy = jsonencode(
	{
	"Version" : "2012-10-17",
	"Id" : "auto-secretsmanager-2",
	"Statement" : [
	{
	"Sid" : "Enable IAM User Permissions",
	"Effect" : "Allow",
	"Principal" : {
	"AWS" : "arn:aws:iam::aaaaaaaaaaa:root" #Root account ARN (remember to remove these comments before deploying, json doesn't like comments)
	},
	"Action" : "kms:*",
	"Resource" : "*"
	},
	{
	"Sid" : "SpokeBuilderAccess",
	"Effect" : "Allow",
	"Action" : [
	"kms:Decrypt",
	"kms:DescribeKey"
	],
	"Resource" : "*",
	"Principal" : {
	"AWS" : [
	"arn:aws:iam::bbbbbbbbbb:role/SpokeABuilderExecutionRole"
	]
	}
	}
	]
	}
	)
	tags = {
	Terraform = "true"
	}
	}

	resource "aws_kms_alias" "hub_secrets_manager_cmk_alias" {
	name = "alias/hub_secrets_manager_cmk"
	target_key_id = aws_kms_key.hub_secrets_manager_cmk.key_id
	}

Let's Do DevOps

🔥Let’s Do DevOps: Share ECR Docker Image and Secrets Between AWS Accounts

New Model Goals

New Model: Centralized ECR and Secrets, ECS Runners

AWS Cross-Account IAM Access

Azure DevOps Builder Image

Hub Account Resources

ECR and Policy

KMS CMK and Policy

Secrets Manager Secret — PAK

Spoke Account Resources

Cloudwatch Log Group

ECS Task Execution Role and Policies

ECS Cluster

ECS Task Definition

ECS Service

App AutoScale Target

ECS AutoScaling Schedule

Profit!

Discussion about this post