🔥Let’s Do DevOps: Terraform Drift Detection using GitHub Native Tools🚀
And how to post the drift to a slack room with links
This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures. You can do it!
Hey all!
Terraform can experience drift when you deploy a resource using terraform, and then someone makes a change to the resource in the cloud manually. Usually you discover that drift when you run terraform the next time, and terraform notices the issues and fixes it by updating the resources.
Hashi’s Cloud Platform (HCP) offers terraform drift detection as part of their (quite expensive) offering (more info here).
So when I was asked to evaluate how we’d implement drift detection in our environment, I was imminently interested.
In the end, I was able to implement drift detection in a few hours in maybe 25-ish lines of bash, which I think is pretty cool. It’s certainly very useful for enterprises to know when their terraform states are showing drift.
This is all implemented in native GitHub Actions with a Slack webhook to send notifications. In the end our slack notifications look like this:
If you only care about the code, you can scroll to the bottom for a link to the GitHub Repo.
Let’s talk about how I built it!
Architecture Summary
We have a lot of terraform repos that are all doing different things, but we decided long ago that we’d have just a single (well, a few) copies of the Actions we use in a centralized Repo. We have these workflows:
Terraform Validate - Run at Pull Request time, builds the terraform plan
Terraform Plan - Run at Deploy time, builds a plan and caches the file
Terraform Apply - Run at Deploy time once approved, applies the terraform plan file
In every repo that terraform configuration lives, we have some Actions that call these workflows:
Terraform Validate - Runs at pull request time, calls Terraform Validate workflow. Builds every terraform permutation in parallel using a matrix syntax.
Terraform Deploy - Manually triggered. Runs a targeted (single) terraform workspace, caches the plan file, waits for human approval (sometimes just in upper environments), and deploys the plan file once approved.
Terraform Batch - Manually triggered. Uses regex to select some number of the possible terraform workspaces to target, and then runs the Terraform Deploy action in parallel using a matrix construct. More information on that here:
The question right now is - where should we make this change? Should it be an entirely new Action or workflow? In the end, I decided to update the Terraform Validate Action - after all, it already builds all our workspaces (I want to check every workspace for drift), and it’s a good fit for updates since it’s so simple.
Let’s go over what we changed, but let’s work backwards:
How to add a webhook in slack (and add sweet custom picture and name)
What changed in the Terraform Validate workflow
How we’re calling it from the Action
First, the Slack Webhook.
Adding a Slack Webhook
Webhooks are so powerful for automation - as I learn more about integrating different systems together, I find myself relying on webhooks almost as much as APIs.
Slack is no different - it supports adding webhooks which allow posting to slack channels from a POST http payload. They’re called Slack Incoming Webhooks, more info here.
First, let’s head over to our Slack client and find the “Customize workspace” link. This shows us all the integrations (as well as all those sweet, sweet emojis).
Next, find the Custom Integrations on the left side.
Under Custom Integrations, you’ll find Incoming WebHooks. Click here to load up this Custom Integration.
Click on the Configuration tab - this will show all the Slack Webhooks that exist in your environment. But we want to add a new one - over on the left click on Request Configuration.
Once a Slack Workspace Admin approves, you’ll get a message like this that links you to the configuration page for your new WebHook!
There’ll be a lot of setup information, you can scroll to the bottom to get to the good stuff - what room do you want to post in? You can request any.
And second, the webhook URL is listed. If this leaks somehow you can click “Regenerate” here and it’ll generate a new URL and the old link will be dead. Note that there is no authentication (at all!) required to use this webhook, so don’t share this URL unless you want to be on the receiving end of random (maybe malicious) spam.
You’ll also see some less-important, but way more fun options below - the name that the webhook will use when it posts to slack, and the Icon it’ll use as the “Profile Pic” for the post.
Hit save and you’re in business. You can test the webhook with a simple curl.
> curl -sX POST --data-urlencode "payload={\"text\": \"Hello world.\"}" https://hooks.slack.com/services/xxxxx
ok%
Okay, so we now have a method to post data to Slack. Now we need to identify if we should post data to slack. So let’s modify our TF Validate workfile to inspect our TF Plan to see if there’s any changes.
Adding Drift Detection to TF Validate
First of all, our TF Validate pipeline just generates a plan, and doesn’t save it. There’s no need to, after all - we just want to print it to stdout, so it’ll print the output in the GitHub pipeline.
The first step to reading the plan is to save it somewhere, so let’s modify our `terraform plan` command to output the plan as a file. On line 5, we added a new argument to output the terraform plan as a binary file, “tf.plan”.
terraform plan \ | |
-input=false \ | |
-lock=false \ | |
-var-file="data/${{ inputs.solution_name }}.tfvars" \ | |
-out tf.plan |
Next, let’s add a new step to our workfile called “Terraform Drift Detection”. We assign an id, specify the shell (bash), and print out what we’re doing.
Printing out what step we’re in is more important than it seems - since this is a workfile, all the output is combined into one step from the calling Action’s run screen. Printing headers for each step helps clarify where we’re at, and more easily diagnose what failed.
- name: Terraform Drift Detection | |
id: tf-drift-detection | |
shell: bash | |
run: | | |
echo "" | |
echo "########################" | |
echo "## Terraform Drift Detection" | |
echo "########################" | |
echo "" |
Next up, we use “terraform show” to read the tf.plan file, and filter it for a few strings - either we’ll find “Your infrastructure matches the configuration” (no drift) or “Terraform will perform the following actions” (drift). We could save the entire change plan to a variable, but for large change sets, that can get unwieldy, and we don’t need that info anyway, so let’s only keep what we need.
We wrap the entire thing on line 2 in a “$()” which will store the response in a variable. That means we can look at $TERRAFORM_PLAN_OUTPUT for future tests.
Which is exactly what we’re doing. On line 5, we check to see if the string “error” is present. If it is, we weren’t able to read the plan for some reason. We print an output for future troubleshooting.
# Read the tf.plan file, filter for change line | |
TERRAFORM_PLAN_OUTPUT=$(terraform show -no-color tf.plan | grep -E 'Your infrastructure matches the configuration|Terraform will perform the following actions' || echo "error") | |
# Test if read successfully | |
if [[ "$TERRAFORM_PLAN_OUTPUT" == "error" ]]; then | |
echo "There was an issue with reading the terraform plan file, continuing to be safe" | |
else | |
echo "Terraform plan output file retrieved successfully, checking for changes" | |
fi |
Okay, here’s the real meat! This is big, so let’s break it down.
On line 1, we check if the terraform plan var says our config matches the infra. This means no drift, we print a line and continue. No need to notify slack that there’s on drift (although this is something I think would be fun to implement in the future).
On line 3, we check if there’s any drift - terraform printing a “will permit the following actions” line. If yes, that means there’s drift, and we need to do a bunch of stuff.
Lines 7-8 create some vars that we’ll use in the links we’ll generate - the http link to the repo, and to the run of this action. We want folks to be able to one-click from the slack notifications right to the Action run.
On line 11, we have a curl. Here’s the entire thing so we can dissect it:
curl -sX POST --data-urlencode "payload={\"text\": \"Drift has been detected in <$REPO_URL|${{ github.event.repository.name }}> in env <${ACTION_RUN_URL}|${{ inputs.solution_name }}>.\"}" ${WEBHOOK_URL}
curl - *nix tool for sending http requests
-sX POST - send a post, and do it “s”ilently, by not printing status of send (we don’t care)
—data-urlencode - encode this payload before sending
For a breakdown of curl, see under the script snippet below
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersif [[ "$TERRAFORM_PLAN_OUTPUT" =~ "infrastructure matches the configuration" ]]; then echo "There are no changes for terraform to apply." elif [[ "$TERRAFORM_PLAN_OUTPUT" =~ "will perform the following actions" ]]; then echo "Drift has been detected, posting notification to Slack room" # Set vars REPO_URL=${{ github.server_url }}/${{ github.repository }} ACTION_RUN_URL=${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} # Send drift detection post to slack curl -sX POST --data-urlencode "payload={\"text\": \"Drift has been detected in <$REPO_URL|${{ github.event.repository.name }}> in env <${ACTION_RUN_URL}|${{ inputs.solution_name }}>.\"}" ${{ inputs.prod_drift_slack_webhook_url }} else echo "There was an error detecting if there were changes" echo "We should investigate what's wrong, or try re-running this workflow" fi
The payload is an escaped version of this. We’re sending:
Drift has been detected in
(Repo Name) - link to the Repo on GitHub
in env
(solution name, like dev-eastus1-01) - link to the Action run that found drift on GitHub
payload={"text": "Drift has been detected in <$REPO_URL|${{ github.event.repository.name }}> in env <${ACTION_RUN_URL}|${{ inputs.solution_name }}>."}
We also need the Slack webhook URL, so we add some inputs at the top. By setting this as required, but with a default value, we are backwards compatible with anyone who hasn’t yet updated their module. Their slack webhooks will just go to google.com.
drift_slack_webhook_url: | |
description: Slack webhook URL for posting drift detection messages | |
required: true | |
default: www.google.com |
Don’t Detect Drift Within PRs
Now we’re in kind of a weird state. When the TF Validate pipeline runs within a PR context, it’ll post to our slack notification channel when there’s any difference between our code and the real env, which is… probably most of the time, right?
So that’s not great. We want a way to both automatically run this action, and to avoid any clashes with the actual PR use case.
The solution is scheduling! We can set our action to run automatically via a cron schedule, and we set our workfile task to only trigger when it’s run within that context.
Let’s update our workfile task first, to read the “github.event_name” that’s triggering the workfile. If it’s triggered via a PR or manually, this task won’t even run. If the Action is triggered via a cron schedule, this task will run. That’s exactly what we need.
To do that in practice we add an “if” on line 4 that says only run this task IF the github.event_name is exactly “schedule”.
- name: Terraform Drift Detection | |
id: tf-drift-detection | |
# If we're a scheduled run, we're in drift detection mode. Check for drift and post to slack if detected | |
if: github.event_name == 'schedule' | |
shell: bash | |
run: | |
Do Drift Detection Nightly
You don’t actually have to do drift detection nightly. You could run it every 15 minutes if you want, but it will post to slack everytime it finds that the state is drifted from the environment, and that could be annoying. I think daily is enough to be informative but not enough to be annoying.
Thankfully, GitHub Actions support a “schedule” trigger using cron syntax.
On line 6 we are setting a “schedule” trigger, and we’re specifying one schedule on line 7, cron of “0 5 * * *” which means every night at 5a UTC, which is 11p my local time, Central USA time. This will automatically trigger the action to run against the default branch without any intervention like a git merge or webhook or button press.
name: Terraform Validation | |
on: | |
pull_request: | |
# Run nightly at 5a UTC / 11p CT | |
schedule: | |
- cron: "0 5 * * *" |
We also update the Action to send the additional info the Task needs to run - the last line, we send the slack webhook URL to our Action.
- name: Terraform Validate | |
uses: kymidd/azure-terraform-validate-action@master | |
with: | |
SSH_KEY: ${{ secrets.SSH_KEY }} | |
location: ${{ env.location }} | |
solution_name: ${{ env.solution_name }} | |
terraform_version: ${{ env.tf_version }} | |
az_tenant_id: ${{ env.az_tenant_id }} | |
az_client_id: ${{ env.az_client_id }} | |
az_subscription_id: ${{ env.az_subscription_id }} | |
tf_storage_resource_group_name: ${{ env.tf_storage_resource_group_name }} | |
tf_storage_account_name: ${{ env.tf_storage_account_name }} | |
tf_storage_container_name: ${{ env.tf_storage_container_name }} | |
tf_state_filename: ${{ env.tf_state_filename }} | |
drift_slack_webhook_url: "https://hooks.slack.com/services/xxxxx/yyyyy" |
Summary
You can find the new version of Terraform Validate published:
The source is here on GitHub.com/KyMidd/azure-terraform-validate-action
For anyone referencing the v1 version of the Action, the change is live NOW.
In this walk-through we talked about what Terraform drift detection is, why it’s useful for a business, and then we walked through how we can establish detecting drift detection, and how to notify slack via a webhook that something has happened (as well as embedding links right back to the action).
All told, this is a pretty cool project. Have you managed to implement drift detection in your own networks? If yes, please tell us how in the comments!
Thanks all. Good luck out there.
kyler