Let’s Do DevOps: Resource-Level Automated Terraform CI/CD Approvals
This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can…
This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can do it!
Hey all!
I wrote a blog entry recently about a desire in my company to automate review and approval of terraform changes. I started out with really simple logic:
If only adds or changes → Automatically approve
If any destroys or rebuilds → Require manual approval
For more details on how I built that, see here.
However, every single time I presented this cool new thing to folks, I got the same questions back.
What if a resource type should never be modified?
or
What if a resource type gets rebuilt all the time, and that’s normal and not a cause for concern?
Which is a great question. Of course the next step from the broad applies-to-everything-equally rule-set would be for exception logic, so some resource types could be treated differently than others. It took me some time, but I built out this logic in bash, and it’s built in such a way that my internal teams can add resource types they want the special logic to apply to in an easy way, and I published all the code so you can do it too all the way at the bottom.
But first, let’s talk about what we did and how.
Simpler Logic and Times
When this project first started, my business wanted to have manual approval on changes. We achieved that in Azure DevOps using the concept of environments and separate stages for terraform plan
vs terraform apply
.
That worked great, but we wanted to automate it more. Can we have the CI/CD behave differently for safer operations, like resource adds or modifies, vs more dangerous operations like destroys or rebuilds?
We can! I built out a simple bash script that uses some grep and if/elif logic to read a text file terraform show
builds from a terraform plan -out plan.out
binary file. Which worked great, but there are two problems:
Updates at scale — This bash is embedded in the pipeline, and if you’re managing a lot of environments like I am, it means each time you built onto this file, you need to update it in 100+ places. Not ideal.
Resource Type Targeting — Clearly, a human being would treat rebuilding an RDS database different from rebuilding a security group rule — one contains potentially irreplaceable data, and one is a resource that contains no data.
Our automation should reflect as well as possible what a real human being would do in this scenario.
The goal here is that this testing and approval gets so good that we can fully automate all terraform IaC pipelines and do many deployments to production each day without human intervention. That’s an incredibly lofty goal, but if we take many small steps to remove friction while adding automated protections, we might get there!
Step 1: Call a Central File
This was the easy part. Rather than having all the code embedded right in a YML pipeline task, like this:
- task: Bash@3 | |
name: AutoApprovalTest | |
displayName: Auto-Approval Test | |
inputs: | |
targetType: 'inline' | |
workingDirectory: $(System.DefaultWorkingDirectory)/$(tf_directory) | |
failOnStderr: true | |
script: | | |
# If no changes, no-op and don't continue | |
if terraform show plan.out | grep -q " 0 to add, 0 to change, 0 to destroy"; then | |
echo "##[section]No changes, terraform apply will not run"; | |
# Check if resources destroyed. If no, don't require approval | |
elif terraform show plan.out | grep -q "to change, 0 to destroy"; then | |
echo "##[section]Approval not required"; | |
echo "##[section]Automatic terraform apply triggered"; | |
echo "##vso[task.setvariable variable=approvalRequired;isOutput=true]false" | |
# Check if resources destroyed. If yes, require approvals | |
else | |
echo "##[section]Terraform apply requires manual approval"; | |
echo "##vso[task.setvariable variable=approvalRequired;isOutput=true]true" | |
fi |
I instead updated the task to something like this. This calls the code from a central location that’s still in the repo. That means we can still track the code, but the powerful thing here is a n
pipelines can refer to the same code, and when we update it, it updates everywhere.
Scale problem = tackled.
Resource Targeted Exceptions
Honestly, adding this exception logic was harder than I imagined. The final product makes my little cat plan.out | grep -q "test string"
logic look pretty silly.
The reason for that is we need branching logic. As a human, you do this without really thinking about it. If you’re looking at a change line and it says, foobar will be deleted
vs foobar will be modified
, you use human heuristic intelligence to figure out what the code is doing, and can reason out why it’s doing it and the impact it’ll have.
One Line for Each Resource Modification
Computers are very very fast, but alas, very dumb. That means we need to copy how you’d reason out the task. First, the free stuff — we need to read the plan.out
file we generated. However, rather than reading the whole file like we did in the v1
code above, we need to filter it down to just the lines where a resource is having an action done to it.
This is part of scaling out protection (should we really read and compute over a very long change plan hundreds of times?), but also a way to start breaking the problem into a looping problem. If each line is equal to one change to one resource, we can loop over it and start making our computer understand what a human would do.
#!/bin/bash | |
# Plan.out is binary file populated with "terraform plan -out plan.out" | |
# Use terraform show to read plan.out as text, and filter for resource change lines, output to file | |
terraform show -no-color plan.out | grep "will be" > plan_decoded.out | |
terraform show -no-color plan.out | grep "must be" >> plan_decoded.out | |
input="plan_decoded.out" |
Now we have a plan_decoded that looks like this:
# module.networking.aws_security_group_rule.rule1 must be replaced
# module.networking.aws_security_group_rule.rule2 will be added
# module.networking.aws_security_group_rule.rule3 will be modified
Which Resource Types are Special?
Now, we need to define arrays of resource types with special exception logic. These are resources that are:
Always Safe to modify/delete/recreate without human approval
Never Safe to modify/delete/recreate without human approval
declare -a ResourceTypesAlwaysUnsafe=( | |
"aws_instance" | |
"foobar" | |
) | |
declare -a ResourceTypesAlwaysSafe=( | |
"aws_security_group_rule" | |
"foobar" | |
) |
These lists can scale-out indefinitely, to cover hundreds of resource types.
No Changes, Exit
If there are no changes at all, then there’s no reason for any of this magic — we should just exit out.
if terraform show plan.out | grep -q " 0 to add, 0 to change, 0 to destroy"; then | |
echo "##[section]No changes detected, terraform apply will not run"; | |
# There are no changes | |
exit 0 | |
fi |
If this test is false, we continue on and start reading the resource changes line by line. This while loop is massive, so let’s break it down.
Loop Over Every Resource Change Line
We do a couple of cool things here.
On line 4, we set a variable, approvalRequired
, to value notSure
. We’ll run a series of tests using exception logic (if the resource_type is in one of our exception arrays above) and then normal logic (if add/change, automatic approval, if destroy/recreate, require manual approval) and set this variable.
We first want to figure out the resource full path, and we can do that with cut on line 7. That gets us to something like:
module.networking.aws_security_group_rule.Inbound_192Slash16_PermitAll
But that’s not a resource type, that’s the full terraform logical path. The simplest way to figure out the resource type is to have cut give us the 2nd to last item in the string, separated by .
characters (since this terraform paths are naturally of a variable length). However, this becomes a chore in bash, but there’s a clever way around it. First, we reverse the string to:
llAtimreP_61hsalS291_dnuobnI.elur_puorg_ytiruces_swa.gnikrowten.eludom
Now, that’s pretty hard to read for a human, but for a computer it makes perfect sense. And now we can tell cut exactly which number to grab, the second item, if we separate by the .
character. Which gets us to:
elur_puorg_ytiruces_swa
Which again, for a human, kinda hard to read. So we reverse it again, and walla!:
aws_security_group_rule
while IFS= read -r line; do | |
# Set approvalRequired | |
approvalRequired="notSure" | |
# Prepare resource path, e.g.: module.networking.aws_security_group_rule.Inbound_192Slash16_PermitAll | |
resource_path=$(echo $line | cut -d " " -f 2) | |
# Prepare resource type, e.g.: aws_security_group_rule | |
resource_type=$(echo $resource_path | rev | cut -d "." -f 2 | rev) |
Now we know our resource type, and we need to start applying our logic.
For Action, Test
We could potentially wrap these tests into a single if/then, but that gets complicated, and I’d rather it loop a few more times on the resource list and be easy to read than be a bit faster and harder to read.
Thus, we look at our $line, which is the entire change plan, like this:
# module.networking.aws_security_group_rule.Inbound_192Slash16_PermitAll will be deleted
And we do a partial match for “will be destroyed” against the line. Searching for the whole string, will be destroyed
instead of just destroyed
should help avoid a false-positive match if a resource’s name includes the word destroyed
for some reason. If it matches, we start testing against our arrays.
First, we look at each item in our ResourceTypesAlwaysUnsafe
array for this resource type that we figured out earlier. If it’s there, we print an informational output line and set the approvalRequired
variable to yes
.
If not, we move on to the ResourceTypesAlwaysSafe
array, and check that. Same process, but we mark approvalRequired
to no
.
And if that still doesn’t match, we follow our normal logic, where destroy == require approval, and set approvalRequired
to yes
.
if [[ $line == *"will be destroyed"* ]]; then | |
# If destroyed resource is always unsafe, trigger approval | |
if [[ ${ResourceTypesAlwaysUnsafe[@]} =~ ${resource_type} ]]; then | |
# Mark this path unsafe, require approval | |
echo "This resource is planned to be deleted, and is always unsafe to destroy without approval:" $resource_path | |
approvalRequired="yes" | |
# If destroyed resource is always safe, then don't trigger approval | |
elif [[ ${ResourceTypesAlwaysSafe[@]} =~ ${resource_type} ]]; then | |
echo "This resource is planned to be deleted, but is marked safe to destroy without approval:" $resource_path | |
approvalRequired="no" | |
# If destroyed resource isn't handled already, then | |
else | |
echo "Approval required on" $resource_path | |
approvalRequired="yes" | |
fi | |
fi |
New Resources Are Safe
We then test for the other types of actions, like “must be replaced,” and “will be updated, and can print whatever informational lines make sense and set our approvalRequired
variable to an appropriate value.
I assume here that creating a new resource (not modifying or destroying an existing one) is always safe regardless, so we don’t further test. If create, safe.
if [[ $line == *"will be created"* ]]; then | |
echo "##[section]Approval not required for" $resource_path | |
approvalRequired="no" | |
fi |
If approvalRequired, Exit
At the bottom of each while loop, we do a few things. Primarily, we check the variable approvalRequired
that should have been set by the logic above.
If it’s true, then there’s no reason to go on with the checking. We immediately bailout and trigger Azure DevOps to prompt the environment owner for approval.
Importantly, this immediately breaks our testing loop, even if every other resource is safe to change.
Due to terraform’s “batching” nature where changes are piled up in source code until apply
is triggered, we can’t deploy changes individually. For this batch of changes, if any of them require approval, we have to set it.
On line 13, if approval isn’t required, we do nothing and continue — just because this resource loop was happy doesn’t mean all resources will have a positive response.
On line 19, if approvalRequired
is still set to notSure
, then something’s gone wrong, and we bail out. The same behavior on line 25, if notSure isn’t any of the above. Bail out, print error messages, don’t move forward.
# If approval required, exit immediately and export values | |
if [[ $approvalRequired == "yes" ]]; then | |
echo "****************************************" | |
echo "##[section]Approval will be required" | |
echo "****************************************" | |
echo "" | |
echo "##vso[task.setvariable variable=approvalRequired;isOutput=true]true" | |
echo "" | |
echo "" | |
break | |
# If approval not required, continue | |
elif [[ $approvalRequired == "no" ]]; then | |
# Can't declare all good here until all lines evaluated, so removed from while loop | |
# After loop, will gather info and make positive approval choice | |
continue | |
# If we haven't made a choice here yet, something has gone wrong, exit | |
elif [[ $approvalRequired == "notSure" ]]; then | |
echo "##[error]Something has gone wrong, can't determine" | |
echo "##[error]Exiting, approval will be required to apply" | |
exit 1 | |
# Shouldn't reach here | |
else | |
echo "##[error]Something has gone wrong, can't determine" | |
echo "##[error]Exiting, approval will be required to apply" | |
exit 1 | |
fi |
Then we end the loop, and cycle until all resources are tested.
All Resources Now Checked
At the end of the script, outside of the while loop, we check to see if the approvalRequired
variable is set to no
. If it is, that means that our loop read every single resource, did it’s logic, and didn’t break out of the loop due to a resource requiring approval.
This is a happy result for us — we can safely determine that all resources have been cleared for automatic deployment, and trigger it.
# If all lines evaluated, and we still haven't decided to require approval, then all | |
# resources have been checked and none triggered approval flow | |
if [[ $approvalRequired == "no" ]]; then | |
echo "****************************************" | |
echo "##[section]Approval will not be required" | |
echo "****************************************" | |
echo "##vso[task.setvariable variable=approvalRequired;isOutput=true]false" | |
echo "" | |
echo "" | |
fi |
Summary
This model gives us an increasing amount of control over which resource types and actions can be automatically approved by our CI/CD deployment logic. I’m certain this will be an ongoing effort as teams begin to understand what this can do for this, and the amount of time it can save.
Here’s the source code so you can go build it yourself!
KyMidd/AzureDevOps_Terraform_ResourceType_AutoApprovals
Contribute to KyMidd/AzureDevOps_Terraform_ResourceType_AutoApprovals development by creating an account on GitHub.github.com
Thanks everyone, and good luck out there.
kyler