Let’s Do DevOps: Resource-Level Automated Terraform CI/CD Approvals

This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can…

Mar 28, 2021

This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can do it!

Hey all!

I wrote a blog entry recently about a desire in my company to automate review and approval of terraform changes. I started out with really simple logic:

If only adds or changes → Automatically approve
If any destroys or rebuilds → Require manual approval

For more details on how I built that, see here.

However, every single time I presented this cool new thing to folks, I got the same questions back.

What if a resource type should never be modified?

What if a resource type gets rebuilt all the time, and that’s normal and not a cause for concern?

Which is a great question. Of course the next step from the broad applies-to-everything-equally rule-set would be for exception logic, so some resource types could be treated differently than others. It took me some time, but I built out this logic in bash, and it’s built in such a way that my internal teams can add resource types they want the special logic to apply to in an easy way, and I published all the code so you can do it too all the way at the bottom.

But first, let’s talk about what we did and how.

Simpler Logic and Times

When this project first started, my business wanted to have manual approval on changes. We achieved that in Azure DevOps using the concept of environments and separate stages for terraform plan vs terraform apply.

That worked great, but we wanted to automate it more. Can we have the CI/CD behave differently for safer operations, like resource adds or modifies, vs more dangerous operations like destroys or rebuilds?

We can! I built out a simple bash script that uses some grep and if/elif logic to read a text file terraform show builds from a terraform plan -out plan.out binary file. Which worked great, but there are two problems:

Updates at scale — This bash is embedded in the pipeline, and if you’re managing a lot of environments like I am, it means each time you built onto this file, you need to update it in 100+ places. Not ideal.
Resource Type Targeting — Clearly, a human being would treat rebuilding an RDS database different from rebuilding a security group rule — one contains potentially irreplaceable data, and one is a resource that contains no data.

Our automation should reflect as well as possible what a real human being would do in this scenario.

The goal here is that this testing and approval gets so good that we can fully automate all terraform IaC pipelines and do many deployments to production each day without human intervention. That’s an incredibly lofty goal, but if we take many small steps to remove friction while adding automated protections, we might get there!

Step 1: Call a Central File

This was the easy part. Rather than having all the code embedded right in a YML pipeline task, like this:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	- task: Bash@3
	name: AutoApprovalTest
	displayName: Auto-Approval Test
	inputs:
	targetType: 'inline'
	workingDirectory: $(System.DefaultWorkingDirectory)/$(tf_directory)
	failOnStderr: true
	script: \|
	# If no changes, no-op and don't continue
	if terraform show plan.out \| grep -q " 0 to add, 0 to change, 0 to destroy"; then
	echo "##[section]No changes, terraform apply will not run";

	# Check if resources destroyed. If no, don't require approval
	elif terraform show plan.out \| grep -q "to change, 0 to destroy"; then
	echo "##[section]Approval not required";
	echo "##[section]Automatic terraform apply triggered";
	echo "##vso[task.setvariable variable=approvalRequired;isOutput=true]false"

	# Check if resources destroyed. If yes, require approvals
	else
	echo "##[section]Terraform apply requires manual approval";
	echo "##vso[task.setvariable variable=approvalRequired;isOutput=true]true"
	fi

view raw tf_approval_bash_in_task.yml hosted with ❤ by GitHub

I instead updated the task to something like this. This calls the code from a central location that’s still in the repo. That means we can still track the code, but the powerful thing here is a n pipelines can refer to the same code, and when we update it, it updates everywhere.

Show hidden characters

	- task: Bash@3
	name: AutoApprovalTest
	displayName: Auto-Approval Test
	inputs:
	filePath: '$(System.DefaultWorkingDirectory)/pipelines/auto_approval_testing/tf_safe_test.sh'
	arguments: ''
	workingDirectory: '$(System.DefaultWorkingDirectory)/$(tf_directory)'
	failOnStderr: false

view raw tf_approval_bash_in_shared_location.yml hosted with ❤ by GitHub

Scale problem = tackled.

Resource Targeted Exceptions

Honestly, adding this exception logic was harder than I imagined. The final product makes my little cat plan.out | grep -q "test string" logic look pretty silly.

The reason for that is we need branching logic. As a human, you do this without really thinking about it. If you’re looking at a change line and it says, foobar will be deleted vs foobar will be modified, you use human heuristic intelligence to figure out what the code is doing, and can reason out why it’s doing it and the impact it’ll have.

One Line for Each Resource Modification

Computers are very very fast, but alas, very dumb. That means we need to copy how you’d reason out the task. First, the free stuff — we need to read the plan.out file we generated. However, rather than reading the whole file like we did in the v1 code above, we need to filter it down to just the lines where a resource is having an action done to it.

This is part of scaling out protection (should we really read and compute over a very long change plan hundreds of times?), but also a way to start breaking the problem into a looping problem. If each line is equal to one change to one resource, we can loop over it and start making our computer understand what a human would do.

Show hidden characters

	#!/bin/bash

	# Plan.out is binary file populated with "terraform plan -out plan.out"
	# Use terraform show to read plan.out as text, and filter for resource change lines, output to file
	terraform show -no-color plan.out \| grep "will be" > plan_decoded.out
	terraform show -no-color plan.out \| grep "must be" >> plan_decoded.out
	input="plan_decoded.out"

view raw tf_approval_bash_read_plan.sh hosted with ❤ by GitHub

Now we have a plan_decoded that looks like this:

  # module.networking.aws_security_group_rule.rule1 must be replaced
  # module.networking.aws_security_group_rule.rule2 will be added
  # module.networking.aws_security_group_rule.rule3 will be modified

Which Resource Types are Special?

Now, we need to define arrays of resource types with special exception logic. These are resources that are:

Always Safe to modify/delete/recreate without human approval
Never Safe to modify/delete/recreate without human approval

Show hidden characters

	declare -a ResourceTypesAlwaysUnsafe=(
	"aws_instance"
	"foobar"
	)
	declare -a ResourceTypesAlwaysSafe=(
	"aws_security_group_rule"
	"foobar"
	)

view raw tf_approval_bash_resource_types.sh hosted with ❤ by GitHub

These lists can scale-out indefinitely, to cover hundreds of resource types.

No Changes, Exit

If there are no changes at all, then there’s no reason for any of this magic — we should just exit out.

Show hidden characters

	if terraform show plan.out \| grep -q " 0 to add, 0 to change, 0 to destroy"; then
	echo "##[section]No changes detected, terraform apply will not run";
	# There are no changes
	exit 0
	fi

view raw tf_approval_bash_no_changes.sh hosted with ❤ by GitHub

If this test is false, we continue on and start reading the resource changes line by line. This while loop is massive, so let’s break it down.

Loop Over Every Resource Change Line

We do a couple of cool things here.

On line 4, we set a variable, approvalRequired, to value notSure. We’ll run a series of tests using exception logic (if the resource_type is in one of our exception arrays above) and then normal logic (if add/change, automatic approval, if destroy/recreate, require manual approval) and set this variable.

We first want to figure out the resource full path, and we can do that with cut on line 7. That gets us to something like:

module.networking.aws_security_group_rule.Inbound_192Slash16_PermitAll

But that’s not a resource type, that’s the full terraform logical path. The simplest way to figure out the resource type is to have cut give us the 2nd to last item in the string, separated by . characters (since this terraform paths are naturally of a variable length). However, this becomes a chore in bash, but there’s a clever way around it. First, we reverse the string to:

llAtimreP_61hsalS291_dnuobnI.elur_puorg_ytiruces_swa.gnikrowten.eludom

Now, that’s pretty hard to read for a human, but for a computer it makes perfect sense. And now we can tell cut exactly which number to grab, the second item, if we separate by the . character. Which gets us to:

elur_puorg_ytiruces_swa

Which again, for a human, kinda hard to read. So we reverse it again, and walla!:

aws_security_group_rule

Show hidden characters

	while IFS= read -r line; do

	# Set approvalRequired
	approvalRequired="notSure"

	# Prepare resource path, e.g.: module.networking.aws_security_group_rule.Inbound_192Slash16_PermitAll
	resource_path=$(echo $line \| cut -d " " -f 2)
	# Prepare resource type, e.g.: aws_security_group_rule
	resource_type=$(echo $resource_path \| rev \| cut -d "." -f 2 \| rev)

view raw tf_approval_bash_while_loop.sh hosted with ❤ by GitHub

Now we know our resource type, and we need to start applying our logic.

For Action, Test

We could potentially wrap these tests into a single if/then, but that gets complicated, and I’d rather it loop a few more times on the resource list and be easy to read than be a bit faster and harder to read.

Thus, we look at our $line, which is the entire change plan, like this:

#  module.networking.aws_security_group_rule.Inbound_192Slash16_PermitAll will be deleted

And we do a partial match for “will be destroyed” against the line. Searching for the whole string, will be destroyed instead of just destroyed should help avoid a false-positive match if a resource’s name includes the word destroyed for some reason. If it matches, we start testing against our arrays.

First, we look at each item in our ResourceTypesAlwaysUnsafe array for this resource type that we figured out earlier. If it’s there, we print an informational output line and set the approvalRequired variable to yes.

If not, we move on to the ResourceTypesAlwaysSafe array, and check that. Same process, but we mark approvalRequired to no.

And if that still doesn’t match, we follow our normal logic, where destroy == require approval, and set approvalRequired to yes.

Show hidden characters

	if [[ $line == "will be destroyed" ]]; then

	# If destroyed resource is always unsafe, trigger approval
	if [[ ${ResourceTypesAlwaysUnsafe[@]} =~ ${resource_type} ]]; then
	# Mark this path unsafe, require approval
	echo "This resource is planned to be deleted, and is always unsafe to destroy without approval:" $resource_path
	approvalRequired="yes"

	# If destroyed resource is always safe, then don't trigger approval
	elif [[ ${ResourceTypesAlwaysSafe[@]} =~ ${resource_type} ]]; then
	echo "This resource is planned to be deleted, but is marked safe to destroy without approval:" $resource_path
	approvalRequired="no"

	# If destroyed resource isn't handled already, then
	else
	echo "Approval required on" $resource_path
	approvalRequired="yes"
	fi
	fi

view raw tf_approval_bash_destroy_logic.sh hosted with ❤ by GitHub

New Resources Are Safe

We then test for the other types of actions, like “must be replaced,” and “will be updated, and can print whatever informational lines make sense and set our approvalRequired variable to an appropriate value.

I assume here that creating a new resource (not modifying or destroying an existing one) is always safe regardless, so we don’t further test. If create, safe.

Show hidden characters

	if [[ $line == "will be created" ]]; then
	echo "##[section]Approval not required for" $resource_path
	approvalRequired="no"
	fi

view raw tf_approval_bash_create_logic.sh hosted with ❤ by GitHub

If approvalRequired, Exit

At the bottom of each while loop, we do a few things. Primarily, we check the variable approvalRequired that should have been set by the logic above.

If it’s true, then there’s no reason to go on with the checking. We immediately bailout and trigger Azure DevOps to prompt the environment owner for approval.

Importantly, this immediately breaks our testing loop, even if every other resource is safe to change.

Due to terraform’s “batching” nature where changes are piled up in source code until apply is triggered, we can’t deploy changes individually. For this batch of changes, if any of them require approval, we have to set it.

On line 13, if approval isn’t required, we do nothing and continue — just because this resource loop was happy doesn’t mean all resources will have a positive response.

On line 19, if approvalRequired is still set to notSure, then something’s gone wrong, and we bail out. The same behavior on line 25, if notSure isn’t any of the above. Bail out, print error messages, don’t move forward.

Show hidden characters

	# If approval required, exit immediately and export values
	if [[ $approvalRequired == "yes" ]]; then
	echo "****************************************"
	echo "##[section]Approval will be required"
	echo "****************************************"
	echo ""
	echo "##vso[task.setvariable variable=approvalRequired;isOutput=true]true"
	echo ""
	echo ""
	break

	# If approval not required, continue
	elif [[ $approvalRequired == "no" ]]; then
	# Can't declare all good here until all lines evaluated, so removed from while loop
	# After loop, will gather info and make positive approval choice
	continue

	# If we haven't made a choice here yet, something has gone wrong, exit
	elif [[ $approvalRequired == "notSure" ]]; then
	echo "##[error]Something has gone wrong, can't determine"
	echo "##[error]Exiting, approval will be required to apply"
	exit 1

	# Shouldn't reach here
	else
	echo "##[error]Something has gone wrong, can't determine"
	echo "##[error]Exiting, approval will be required to apply"
	exit 1
	fi

view raw tf_approval_bash_do_action.sh hosted with ❤ by GitHub

Then we end the loop, and cycle until all resources are tested.

All Resources Now Checked

At the end of the script, outside of the while loop, we check to see if the approvalRequired variable is set to no. If it is, that means that our loop read every single resource, did it’s logic, and didn’t break out of the loop due to a resource requiring approval.

This is a happy result for us — we can safely determine that all resources have been cleared for automatic deployment, and trigger it.

Show hidden characters

	# If all lines evaluated, and we still haven't decided to require approval, then all
	# resources have been checked and none triggered approval flow
	if [[ $approvalRequired == "no" ]]; then
	echo "****************************************"
	echo "##[section]Approval will not be required"
	echo "****************************************"
	echo "##vso[task.setvariable variable=approvalRequired;isOutput=true]false"
	echo ""
	echo ""
	fi

view raw tf_approval_bash_all_safe.sh hosted with ❤ by GitHub

Summary

This model gives us an increasing amount of control over which resource types and actions can be automatically approved by our CI/CD deployment logic. I’m certain this will be an ongoing effort as teams begin to understand what this can do for this, and the amount of time it can save.

Here’s the source code so you can go build it yourself!

KyMidd/AzureDevOps_Terraform_ResourceType_AutoApprovals
Contribute to KyMidd/AzureDevOps_Terraform_ResourceType_AutoApprovals development by creating an account on GitHub.github.com

Thanks everyone, and good luck out there.
kyler

Let's Do DevOps

Discussion about this post