Terraform AzureRM Provider Has a Breaking Bug, Azure and Hashi Won’t Fix
tl;dr: Azure API bug renders Terraform helpless to manage FrontDoor and several other Azure services. Both companies publicly say they’re…
tl;dr: Azure API bug renders Terraform helpless to manage FrontDoor and several other Azure services. Both companies publicly say they’re working on it. Meanwhile, customers are stuck. Read on for more details.
Update: After much pressure, Hashi has rolled back their patch that more stringently enforced case and caused this issue to be exposed to users. As of AzureRM provider release 2.40 this issue should be fixed. If it’s not please report back!
Update 2: Microsoft has now also released a patch to their API that ignores case on API requests. That missed the point a bit — it’s a problem what case is sent back to users, not what case is received. However, this issue is much less urgent, since Hashi has helpfully hidden it with the above change.
Hey all!
Normally the focus of my articles is on how to build something. I focus on how to combine different technologies, or how process and platform can do some great things for your team.
This one will be different — it’s about a sneaky bug we’ve found in Azure’s FrontDoor resource API, and how both Azure and Hashi are thus far refusing to budge in fixing it. They have vastly different reasons for not doing so.
Azure’s Perspective
Azure Cloud is built in an asymmetric way between the product and API groups. First, the product team creates…, well, they create products, obviously. Then as a second stage, the API team follows on and bootstraps APIs into these products for folks to manage them with AZ CLI or other services that consume APIs, which for many will be Terraform.
This is an especially unusual development pattern compared with AWS. In AWS, to my knowledge, product dev teams are also responsible for their API, meaning synchronous and more full-featured API development with the product.
In short, APIs are an afterthought at Azure
Because of Azure’s asymmetric development, it’s clear they deprioritized the API development, which puts products like Terraform at a disadvantage in supporting them. After all, if it works in the console Azure is happy.
And that’s so far Azure’s response to my requests — our APIs sometimes lag behind. Just wait.
Hashi’s Perspective
HashiCorp’s Terraform product utilizes platform APIs to provision and manage resources. It doesn’t interact with the web console like a human would to manage resources.
That puts them at a distinct disadvantage here. Their product is only as good as the platform API support is, and with Azure deprioritizing API development, they aren’t as effective at supporting Azure as they are for a platform like AWS.
Because of this culture deprioritization I wouldn’t expect Terraform (or any API-driven management tool) to improve significantly in terms of effectiveness — without cultural support at the target platform, how could it?
For this particular issue, Hashi claims (link) it has already made several technical apologies for the unusual behavior of Azure APIs, particularly in the networking space. Some of their APIs change behavior based on json serialization, which directly contravenes the json RFC, as well as other behaviors. They claim that furthering these bandaids will eventually lead to unpredictable and nuanced failure scenarios that’ll be hard to root cause due to these internal patches.
I don’t want to hit this too hard, but with a single team managing the APIs for Azure tooling, why are the APIs so fragile and inconsistent? Surely centralizing expertise on how APIs should be written should strengthen and standardize API structure and syntax? But that’s not what we’re seeing.
Customer Perspective
Regardless of who you feel is right (Hashi’s right), it leaves customers in an unfortunate place — Terraform is unable to manage Azure FrontDoor, a critical piece of web server hosting infrastructure in Azure.
The advice I have from Microsoft is to just wait. And the advice I have from Hashi is… crickets. They are waiting for Microsoft to act. So we’re stuck.
Since Terraform (and this Azure provider layer) is open-source, the bug report is open source, and users have made all sorts of suggestions to get around it. Hashi staff has, for whatever reason, marked all mention of customer-side workarounds as off-topic, which stifles folks attempting to work around the issue.
This is a bad look for Hashi — user input on workarounds, especially, particularly on bugs that Hashi could fix but chooses not to, should not be suppressed.
Technical Deep-Dive
The bug here was first noticed on Terraform’s AzureRM release 0.24.0. Here’s the bug report, from August 22, almost 3 months ago today:
Azurerm_frontdoor with v2.24.0 breaks when azure frontdoor is edited in portal. · Issue #8208 ·…
Community Note Please vote on this issue by adding a 👍 reaction to the original issue to help the community and…github.com
The gist of it is this, if Terraform utilizes an AzureRM provider of 0.24.X or newer, then existing FrontDoor resources generate an error when Terraform refreshes their state. The error looks like this:
Error: flattening backend_pool: ID was missing the healthProbeSettings element
The root cause, identified in the bug, is that Azure’s FrontDoor resource API returns inconsistent casing on resource GUID strings. Azure’s own API guide (link) says that the casing of their API responses should match the casing of API requests. This published API document is of course something Hashi relies on to be true, but here a request to:
.../providers/Microsoft.Network/frontdoors/...
Gets a response about resource (note the capital “D” in frontDoors):
.../providers/Microsoft.Network/frontDoors/...
Hashi can write logic around this on the AzureRM provider side that helps correct the casing of responses or requests, but that logic is exactly what they refer to in terms of a bandaid that might generate further issues downstream for other resources. Should their outputs or internal references use the request casing or the response casing?
The state file database terraform keeps for resource management could quickly become a patchwork of bandaids as each layer attempts to match this one-off casing for only certain resources of Azure’s.
But Why Now?
What’s interesting is this Azure API behavior didn’t change to start this behavior. As far as we can tell it’s been wrong this entire time. Terraform was previously more forgiving about the inconsistent casing, proving that a Hashi-side change is possible.
The PR that introduced this interestingly correct yet breaking behavior is here:
frontdoor: refactoring & ensuring ID's are consistent by tombuildsstuff · Pull Request #8146 ·…
Dismiss GitHub is home to over 50 million developers working together to host and review code, manage projects, and…github.com
This bug was released in the weekly AzureRM release of v0.2.24 on Aug 20, 2020.
Release v2.24.0 · terraform-providers/terraform-provider-azurerm
FEATURES: IMPROVEMENTS: dependencies: update containerinstance to API version 2019-12-01 (#8110)…github.com
This PR specifically standardizes the formatting and nomenclature of FrontDoor API-provided resource references so they can be more easily used for other dependent resources without modification.
So Hashi implemented a higher validation standard than the Azure SDK team themselves has, leading to this breaking bug.
Workarounds
The workarounds aren’t great. The most promising one is to use a version of the AzureRM provider from before this PR was merged, v2.23.x.
However, v2.23 was released in mid-August, and there are many resource configurations and even some entire resources which are missing from it. If your team already uses those resources or attributes, you won’t be able to move to it. If you do successfully move back and then your team wants to use them, they will be blocked — terraform will error out because of the unrecognized attribute.
Despite the problem ostensibly being on the Azure side, the issue is experienced by a terraform command failing to run, which the teams I’ve worked with interpret as a problem with Terraform.
As with some other Terraform problems, you can also solve this with state file hacking. “Hacking” is a misnomer, and I use it less to indicate breaking in and more to indicate that these types of solutions are rough and prone to breaking. Take a backup of your state file before making any changes.
This GitHub comment (Hidden by Hashi for being off-topic?!) from cpressland endeavors to fix the issue in their individual terraform state.
Azurerm_frontdoor with v2.24.0 breaks when azure frontdoor is edited in portal. · Issue #8208 ·…
Community Note Please vote on this issue by adding a 👍 reaction to the original issue to help the community and…github.com
They find that several resource types are seeing this inconsistent behavior, and fix them, but terraform notices the updates and gives them a pretty scare error message:
Error: provider produced inconsistent final plan
(removed)
This is a bug in the provider, which should be reported in the provider's own issue tracker.
They run terraform a few times, and this issue sorts out somehow, but it’s unclear how, why, or if this is a repeatable fix.
Personally, I wouldn’t advise doing this.
Even if this fix is perfect, you’ll need to do this for all resources built with these bad APIs every time they’re built, in all environments, across all state files.
If a team member rebuilds an environment, it will break until you manually fix. And again, that’s only if this fix is reliable, which isn’t yet proven.
Summary
I wish I had better news here. It has been nearly 3 months, and neither company has budged. I am escalating as much as I can with both, and no movement so far.
I’m sure far louder and more informed voices than mine have called out this issue as a problem for their teams, but I’ll add my voice to theirs.
Hashi and Azure, please fix this issue for your users! We depend a great deal on both of your technologies to do our jobs and accomplish our goals. The health care services I help facilitate at my company are directly impacted and harmed by this standoff, and I ask that it please, please be handled soon.
Thanks all. Good luck out there.
kyler