Terraform state: The good, the bad, and the ugly
A dive into Terraform state management—why it exists, common pitfalls teams face, and the security disasters waiting to happen if you're not careful

If you've worked with Terraform for more than a day, you've encountered the state file. Maybe you've cursed at it when it locked during a critical deployment. Maybe you've panicked when someone accidentally deleted it. Or maybe—just maybe—you've committed one with Azure credentials baked in and had a very awkward conversation with your security team.
Terraform state is simultaneously one of the most elegant solutions and one of the biggest operational headaches in infrastructure-as-code. Let's break down why it exists, where it hurts, and how it can spectacularly blow up in your face.
THE GOOD: WHY STATE EXISTS IN THE FIRST PLACE
Terraform's state file is basically a snapshot of your infrastructure at a given point in time. It maps your HCL configuration to real-world resources and stores metadata that Terraform needs to manage those resources effectively.
It solves the "what's actually deployed" problem
Without state, Terraform would have to query your cloud provider every single time to figure out what exists. That sounds reasonable until you're managing 500+ resources across multiple regions. State makes terraform plan fast by maintaining a local cache of your infrastructure.
# Your config says this:
resource "azurerm_linux_virtual_machine" "web" {
name = "web-vm"
resource_group_name = azurerm_resource_group.main.name
location = "East US"
size = "Standard_B1s"
# ... other config
}
# State file knows this exists with its Azure resource ID
# and can detect when you change size or other attributes
It tracks resource dependencies
Terraform builds a dependency graph from your state file. This lets it know that your Azure SQL Database depends on your NSG, which depends on your VNet. When you destroy resources, it tears them down in the correct order. Try doing that manually at 2 AM during an incident—I'll wait.
It enables collaboration (sort of)
When multiple engineers work on the same infrastructure, state acts as the source of truth. Combined with remote backends and locking, it prevents the classic "we both applied changes at the same time and now everything's broken" scenario.
The state file also stores output values, which other Terraform configurations can reference:
# In your networking module
output "vnet_id" {
value = azurerm_virtual_network.main.id
}
# Another team's config can use this
data "terraform_remote_state" "network" {
backend = "azurerm"
config = {
storage_account_name = "companytfstate"
container_name = "tfstate"
key = "network.terraform.tfstate"
}
}
resource "azurerm_linux_virtual_machine" "app" {
# Reference the VNet from remote state
subnet_id = data.terraform_remote_state.network.outputs.subnet_id
# ... other config
}
This is actually pretty slick when it works. The problem is all the ways it doesn't work.
THE BAD: COMMON PAIN POINTS THAT'LL MAKE YOU QUESTION YOUR LIFE CHOICES
State locking conflicts
You know that feeling when you run terraform apply and get hit with "Error acquiring the state lock"? That's state locking doing its job—preventing concurrent modifications. But when Jenkins crashes mid-apply or someone's laptop dies, that lock stays in place.
# The dreaded error
Error: Error acquiring the state lock
Error message: blob is already locked
Lock Info:
ID: a1b2c3d4-5678-90ef-ghij-klmnopqrstuv
Path: companytfstate/tfstate/prod.terraform.tfstate
Operation: OperationTypeApply
Who: jenkins@ci-runner-42
Created: 2026-02-04 14:23:17.123456789 +0000 UTC
Now you're stuck deciding: is that lock legitimate, or is it a zombie from a failed run? Force-unlock and you might corrupt state if someone's actually using it. Don't unlock and your deployment is blocked.
# The nuclear option
terraform force-unlock a1b2c3d4-5678-90ef-ghij-klmnopqrstuv
# Proceed with extreme caution
State drift: When reality diverges from code
Someone made a "quick fix" in the Azure Portal. Someone else manually scaled a VMSS. Now your state file thinks you have 3 instances, but you actually have 5. Terraform has no idea what happened.
# Running refresh shows the horror
$ terraform refresh
azurerm_linux_virtual_machine.web[3]: Refreshing state... [id=/subscriptions/.../manually-created-vm-1]
azurerm_linux_virtual_machine.web[4]: Refreshing state... [id=/subscriptions/.../manually-created-vm-2]
# Now what? Delete them? Keep them? Cry?
Drift detection is possible with terraform plan -refresh-only, but it only tells you about resources Terraform knows about. If someone created resources outside of Terraform, you're blind to them until they cause an outage.
Team collaboration nightmares
Even with remote state and locking, you'll hit these scenarios:
Scenario 1: The race condition
Alice runs
terraform planat 3:00 PMBob runs
terraform applyat 3:05 PMAlice runs
terraform applyat 3:10 PM using her stale planState is now inconsistent with both their changes
Scenario 2: The branch problem
Feature branches create competing versions of state
Merging code is easy; merging state files is... not a thing
You end up with orphaned resources nobody remembers creating
Scenario 3: The "who changed what" mystery State files don't have audit logs. Someone deleted your load balancer and you have no idea who or when. Good luck with that post-mortem.
Performance degradation at scale
Once you hit thousands of resources, terraform plan slows to a crawl even with state caching. You'll start breaking up monolithic state files into smaller modules just to keep things manageable:
# This used to take 10 seconds, now takes 5 minutes
$ terraform plan
Refreshing Terraform state in-memory prior to plan...
azurerm_virtual_network.main: Refreshing state... [id=/subscriptions/.../virtualNetworks/main-vnet]
azurerm_subnet.public[0]: Refreshing state... [id=/subscriptions/.../subnets/public-subnet-0]
# ... 2,437 more resources to go
THE UGLY: SECURITY DISASTERS AND STATE CORRUPTION
This is where things get scary. State files are ticking time bombs if you don't handle them properly.
Sensitive data exposure
Here's the thing nobody tells you: Terraform state files store everything in plaintext. Passwords, API keys, database connection strings—all sitting there unencrypted.
{
"version": 4,
"terraform_version": "1.6.0",
"resources": [
{
"type": "azurerm_mssql_database",
"name": "main",
"instances": [
{
"attributes": {
"administrator_login": "sqladmin",
"administrator_login_password": "SuperSecretPassword123!",
"fully_qualified_domain_name": "mydb.database.windows.net"
}
}
]
}
]
}
If you commit this to Git (don't laugh, it happens constantly), you've just handed attackers your database credentials. Even if you remove it in the next commit, it's in Git history forever.
If your Azure Storage Account permissions are misconfigured, anyone with read access can download your state blob and extract secrets. If you're using Terraform Cloud's free tier, your state is visible to anyone in your organization.
State corruption and the recovery nightmare
State corruption usually happens when:
A Terraform run gets interrupted mid-apply
Two applies run simultaneously (locking failed)
Manual edits go wrong
The state backend has issues (Azure Blob Storage transient errors, network timeouts)
When state corrupts, you're in for a bad time:
$ terraform plan
Error: Failed to load state: state snapshot was created by Terraform v1.6.0,
which is newer than current v1.5.0; upgrade to Terraform v1.6.0 or greater to work with this state
# Or worse...
Error: Failed to decode state: invalid character 'x' looking for beginning of value
Now you're trying to restore from backups (you have backups, right?), manually editing JSON to fix corruption, or using terraform import to rebuild state from scratch. For every. Single. Resource.
# The tedious recovery process
$ terraform import azurerm_linux_virtual_machine.web /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/mygroup/providers/Microsoft.Compute/virtualMachines/web-vm
$ terraform import azurerm_network_security_group.web /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/mygroup/providers/Microsoft.Network/networkSecurityGroups/web-nsg
# ... repeat 500 more times
The "someone deleted production" disaster
The ultimate nightmare: someone runs terraform destroy on production state by accident. Or they delete the Azure Storage Account container holding state. Or they push broken state that marks everything for deletion.
# This should require two-factor authentication and a blood oath
$ terraform destroy
# ...
Destroy complete! Resources: 437 destroyed.
# Congratulations, you've just deleted production
Without backups and versioning enabled on your state backend, recovery is somewhere between "extremely difficult" and "update your resume."
BEST PRACTICES: MAKING PEACE WITH STATE
Alright, enough doom and gloom. Here's how to avoid these disasters.
Always use remote backends
Never, ever keep state files local. Use Azure Storage, Terraform Cloud, or another remote backend with versioning and encryption.
terraform {
backend "azurerm" {
resource_group_name = "terraform-state-rg"
storage_account_name = "companytfstate"
container_name = "tfstate"
key = "prod.terraform.tfstate"
# State locking is built-in with blob leases
# No separate lock table needed like DynamoDB
}
}
Make sure your Azure Storage Account has:
Versioning enabled (recover from accidental deletes/corruption)
Encryption at rest (enabled by default with Microsoft-managed keys or use CMK)
Strict RBAC policies (principle of least privilege)
Soft delete enabled (provides undelete capability)
Diagnostic logging enabled (audit trail)
State locking with Azure blob leases
State locking prevents concurrent modifications. The good news? Azure Storage handles this automatically using blob leases—no separate infrastructure required.
When Terraform acquires a lock, it creates a lease on the state blob. If another process tries to modify the same state, it fails immediately:
# Azure handles locking automatically via blob leases
# No additional resources to provision
# Lock duration: 15 seconds by default, auto-renewed during operations
You can verify locking behavior by checking blob properties in the Azure Portal or via CLI:
az storage blob show \
--account-name companytfstate \
--container-name tfstate \
--name prod.terraform.tfstate \
--query "properties.lease" -o table
The lease status will show "locked" when Terraform is actively modifying state.
Never commit secrets to Terraform config
Use secret management tools and reference them dynamically:
# BAD: Hardcoded secrets
resource "azurerm_mssql_server" "main" {
administrator_login_password = "SuperSecretPassword123!" # Don't do this
}
# GOOD: Reference from Azure Key Vault
data "azurerm_key_vault" "main" {
name = "company-keyvault"
resource_group_name = "shared-services-rg"
}
data "azurerm_key_vault_secret" "db_password" {
name = "sql-admin-password"
key_vault_id = data.azurerm_key_vault.main.id
}
resource "azurerm_mssql_server" "main" {
administrator_login_password = data.azurerm_key_vault_secret.db_password.value
}
The secret still ends up in state, but at least it's not in your Git repo. For extra paranoia, use customer-managed keys (CMK) in Azure Key Vault to encrypt state and configure RBAC to restrict access.
Master state manipulation commands
Sometimes you need to manually fix state. Learn these commands:
# Import existing resources into state
terraform import azurerm_linux_virtual_machine.example /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/mygroup/providers/Microsoft.Compute/virtualMachines/example-vm
# Remove a resource from state (without deleting the actual resource)
terraform state rm azurerm_linux_virtual_machine.old_server
# Move a resource to a different address
terraform state mv azurerm_linux_virtual_machine.old azurerm_linux_virtual_machine.new
# List all resources in state
terraform state list
# Show details about a specific resource
terraform state show azurerm_linux_virtual_machine.web
Use these carefully—they directly modify state. Always back up state before manual operations:
terraform state pull > backup.tfstate
Implement proper CI/CD workflows
Don't let humans run terraform apply from their laptops. Use CI/CD pipelines with:
Automatic
terraform planon pull requestsRequired approvals before apply
State locking handled automatically
Audit logs of who applied what
Separate state files per environment
# Example GitHub Actions workflow
name: Terraform
on:
pull_request:
paths: ['terraform/**']
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- run: terraform init
- run: terraform plan -out=tfplan
- uses: actions/upload-artifact@v3
with:
name: tfplan
path: tfplan
Split state into manageable pieces
Don't put everything in one giant state file. Break it up by:
Environment (dev, staging, prod)
Layer (networking, compute, data)
Team ownership (platform, security, apps)
terraform/
├── networking/
│ ├── backend.tf
│ └── main.tf
├── compute/
│ ├── backend.tf
│ └── main.tf
└── data/
├── backend.tf
└── main.tf
This limits blast radius and improves performance. Just be careful with cross-stack dependencies.
WRAPPING UP
Terraform state is a necessary evil. It's the price you pay for declarative infrastructure management. The good news is that the benefits—dependency tracking, collaboration, drift detection—outweigh the operational overhead once you've got proper workflows in place.
The bad news is that you'll definitely learn these lessons the hard way at least once. When you do, remember:
Remote state with versioning and locking is non-negotiable
Secrets in state files are a security risk—treat state like sensitive data
State corruption happens—have backups and know how to restore
Manual state operations are powerful and dangerous—measure twice, cut once
And if you ever find yourself running terraform destroy in production, maybe take a coffee break first and verify you're in the right directory. Your future self will thank you.

