Terraform state: The good, the bad, and the ugly

If you've worked with Terraform for more than a day, you've encountered the state file. Maybe you've cursed at it when it locked during a critical deployment. Maybe you've panicked when someone accidentally deleted it. Or maybe—just maybe—you've committed one with Azure credentials baked in and had a very awkward conversation with your security team.

Terraform state is simultaneously one of the most elegant solutions and one of the biggest operational headaches in infrastructure-as-code. Let's break down why it exists, where it hurts, and how it can spectacularly blow up in your face.

THE GOOD: WHY STATE EXISTS IN THE FIRST PLACE

Terraform's state file is basically a snapshot of your infrastructure at a given point in time. It maps your HCL configuration to real-world resources and stores metadata that Terraform needs to manage those resources effectively.

It solves the "what's actually deployed" problem

Without state, Terraform would have to query your cloud provider every single time to figure out what exists. That sounds reasonable until you're managing 500+ resources across multiple regions. State makes terraform plan fast by maintaining a local cache of your infrastructure.

# Your config says this:
resource "azurerm_linux_virtual_machine" "web" {
  name                = "web-vm"
  resource_group_name = azurerm_resource_group.main.name
  location            = "East US"
  size                = "Standard_B1s"
  # ... other config
}

# State file knows this exists with its Azure resource ID
# and can detect when you change size or other attributes

It tracks resource dependencies

Terraform builds a dependency graph from your state file. This lets it know that your Azure SQL Database depends on your NSG, which depends on your VNet. When you destroy resources, it tears them down in the correct order. Try doing that manually at 2 AM during an incident—I'll wait.

It enables collaboration (sort of)

When multiple engineers work on the same infrastructure, state acts as the source of truth. Combined with remote backends and locking, it prevents the classic "we both applied changes at the same time and now everything's broken" scenario.

The state file also stores output values, which other Terraform configurations can reference:

# In your networking module
output "vnet_id" {
  value = azurerm_virtual_network.main.id
}

# Another team's config can use this
data "terraform_remote_state" "network" {
  backend = "azurerm"
  config = {
    storage_account_name = "companytfstate"
    container_name       = "tfstate"
    key                  = "network.terraform.tfstate"
  }
}

resource "azurerm_linux_virtual_machine" "app" {
  # Reference the VNet from remote state
  subnet_id = data.terraform_remote_state.network.outputs.subnet_id
  # ... other config
}

This is actually pretty slick when it works. The problem is all the ways it doesn't work.

THE BAD: COMMON PAIN POINTS THAT'LL MAKE YOU QUESTION YOUR LIFE CHOICES

State locking conflicts

You know that feeling when you run terraform apply and get hit with "Error acquiring the state lock"? That's state locking doing its job—preventing concurrent modifications. But when Jenkins crashes mid-apply or someone's laptop dies, that lock stays in place.

# The dreaded error
Error: Error acquiring the state lock

Error message: blob is already locked
Lock Info:
  ID:        a1b2c3d4-5678-90ef-ghij-klmnopqrstuv
  Path:      companytfstate/tfstate/prod.terraform.tfstate
  Operation: OperationTypeApply
  Who:       jenkins@ci-runner-42
  Created:   2026-02-04 14:23:17.123456789 +0000 UTC

Now you're stuck deciding: is that lock legitimate, or is it a zombie from a failed run? Force-unlock and you might corrupt state if someone's actually using it. Don't unlock and your deployment is blocked.

# The nuclear option
terraform force-unlock a1b2c3d4-5678-90ef-ghij-klmnopqrstuv

# Proceed with extreme caution

State drift: When reality diverges from code

Someone made a "quick fix" in the Azure Portal. Someone else manually scaled a VMSS. Now your state file thinks you have 3 instances, but you actually have 5. Terraform has no idea what happened.

# Running refresh shows the horror
$ terraform refresh

azurerm_linux_virtual_machine.web[3]: Refreshing state... [id=/subscriptions/.../manually-created-vm-1]
azurerm_linux_virtual_machine.web[4]: Refreshing state... [id=/subscriptions/.../manually-created-vm-2]

# Now what? Delete them? Keep them? Cry?

Drift detection is possible with terraform plan -refresh-only, but it only tells you about resources Terraform knows about. If someone created resources outside of Terraform, you're blind to them until they cause an outage.

Team collaboration nightmares

Even with remote state and locking, you'll hit these scenarios:

Scenario 1: The race condition

Alice runs terraform plan at 3:00 PM
Bob runs terraform apply at 3:05 PM
Alice runs terraform apply at 3:10 PM using her stale plan
State is now inconsistent with both their changes

Scenario 2: The branch problem

Feature branches create competing versions of state
Merging code is easy; merging state files is... not a thing
You end up with orphaned resources nobody remembers creating

Scenario 3: The "who changed what" mystery State files don't have audit logs. Someone deleted your load balancer and you have no idea who or when. Good luck with that post-mortem.

Performance degradation at scale

Once you hit thousands of resources, terraform plan slows to a crawl even with state caching. You'll start breaking up monolithic state files into smaller modules just to keep things manageable:

# This used to take 10 seconds, now takes 5 minutes
$ terraform plan
Refreshing Terraform state in-memory prior to plan...
azurerm_virtual_network.main: Refreshing state... [id=/subscriptions/.../virtualNetworks/main-vnet]
azurerm_subnet.public[0]: Refreshing state... [id=/subscriptions/.../subnets/public-subnet-0]
# ... 2,437 more resources to go

THE UGLY: SECURITY DISASTERS AND STATE CORRUPTION

This is where things get scary. State files are ticking time bombs if you don't handle them properly.

Sensitive data exposure

Here's the thing nobody tells you: Terraform state files store everything in plaintext. Passwords, API keys, database connection strings—all sitting there unencrypted.

{
  "version": 4,
  "terraform_version": "1.6.0",
  "resources": [
    {
      "type": "azurerm_mssql_database",
      "name": "main",
      "instances": [
        {
          "attributes": {
            "administrator_login": "sqladmin",
            "administrator_login_password": "SuperSecretPassword123!",
            "fully_qualified_domain_name": "mydb.database.windows.net"
          }
        }
      ]
    }
  ]
}

If you commit this to Git (don't laugh, it happens constantly), you've just handed attackers your database credentials. Even if you remove it in the next commit, it's in Git history forever.

If your Azure Storage Account permissions are misconfigured, anyone with read access can download your state blob and extract secrets. If you're using Terraform Cloud's free tier, your state is visible to anyone in your organization.

State corruption and the recovery nightmare

State corruption usually happens when:

A Terraform run gets interrupted mid-apply
Two applies run simultaneously (locking failed)
Manual edits go wrong
The state backend has issues (Azure Blob Storage transient errors, network timeouts)

When state corrupts, you're in for a bad time:

$ terraform plan
Error: Failed to load state: state snapshot was created by Terraform v1.6.0,
which is newer than current v1.5.0; upgrade to Terraform v1.6.0 or greater to work with this state

# Or worse...
Error: Failed to decode state: invalid character 'x' looking for beginning of value

Now you're trying to restore from backups (you have backups, right?), manually editing JSON to fix corruption, or using terraform import to rebuild state from scratch. For every. Single. Resource.

# The tedious recovery process
$ terraform import azurerm_linux_virtual_machine.web /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/mygroup/providers/Microsoft.Compute/virtualMachines/web-vm
$ terraform import azurerm_network_security_group.web /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/mygroup/providers/Microsoft.Network/networkSecurityGroups/web-nsg
# ... repeat 500 more times

The "someone deleted production" disaster

The ultimate nightmare: someone runs terraform destroy on production state by accident. Or they delete the Azure Storage Account container holding state. Or they push broken state that marks everything for deletion.

# This should require two-factor authentication and a blood oath
$ terraform destroy
# ...
Destroy complete! Resources: 437 destroyed.

# Congratulations, you've just deleted production

Without backups and versioning enabled on your state backend, recovery is somewhere between "extremely difficult" and "update your resume."

BEST PRACTICES: MAKING PEACE WITH STATE

Alright, enough doom and gloom. Here's how to avoid these disasters.

Always use remote backends

Never, ever keep state files local. Use Azure Storage, Terraform Cloud, or another remote backend with versioning and encryption.

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "companytfstate"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"

    # State locking is built-in with blob leases
    # No separate lock table needed like DynamoDB
  }
}

Make sure your Azure Storage Account has:

Versioning enabled (recover from accidental deletes/corruption)
Encryption at rest (enabled by default with Microsoft-managed keys or use CMK)
Strict RBAC policies (principle of least privilege)
Soft delete enabled (provides undelete capability)
Diagnostic logging enabled (audit trail)

State locking with Azure blob leases

State locking prevents concurrent modifications. The good news? Azure Storage handles this automatically using blob leases—no separate infrastructure required.

When Terraform acquires a lock, it creates a lease on the state blob. If another process tries to modify the same state, it fails immediately:

# Azure handles locking automatically via blob leases
# No additional resources to provision
# Lock duration: 15 seconds by default, auto-renewed during operations

You can verify locking behavior by checking blob properties in the Azure Portal or via CLI:

az storage blob show \
  --account-name companytfstate \
  --container-name tfstate \
  --name prod.terraform.tfstate \
  --query "properties.lease" -o table

The lease status will show "locked" when Terraform is actively modifying state.

Never commit secrets to Terraform config

Use secret management tools and reference them dynamically:

# BAD: Hardcoded secrets
resource "azurerm_mssql_server" "main" {
  administrator_login_password = "SuperSecretPassword123!"  # Don't do this
}

# GOOD: Reference from Azure Key Vault
data "azurerm_key_vault" "main" {
  name                = "company-keyvault"
  resource_group_name = "shared-services-rg"
}

data "azurerm_key_vault_secret" "db_password" {
  name         = "sql-admin-password"
  key_vault_id = data.azurerm_key_vault.main.id
}

resource "azurerm_mssql_server" "main" {
  administrator_login_password = data.azurerm_key_vault_secret.db_password.value
}

The secret still ends up in state, but at least it's not in your Git repo. For extra paranoia, use customer-managed keys (CMK) in Azure Key Vault to encrypt state and configure RBAC to restrict access.

Master state manipulation commands

Sometimes you need to manually fix state. Learn these commands:

# Import existing resources into state
terraform import azurerm_linux_virtual_machine.example /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/mygroup/providers/Microsoft.Compute/virtualMachines/example-vm

# Remove a resource from state (without deleting the actual resource)
terraform state rm azurerm_linux_virtual_machine.old_server

# Move a resource to a different address
terraform state mv azurerm_linux_virtual_machine.old azurerm_linux_virtual_machine.new

# List all resources in state
terraform state list

# Show details about a specific resource
terraform state show azurerm_linux_virtual_machine.web

Use these carefully—they directly modify state. Always back up state before manual operations:

terraform state pull > backup.tfstate

Implement proper CI/CD workflows

Don't let humans run terraform apply from their laptops. Use CI/CD pipelines with:

Automatic terraform plan on pull requests
Required approvals before apply
State locking handled automatically
Audit logs of who applied what
Separate state files per environment

# Example GitHub Actions workflow
name: Terraform
on:
  pull_request:
    paths: ['terraform/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
      - run: terraform init
      - run: terraform plan -out=tfplan
      - uses: actions/upload-artifact@v3
        with:
          name: tfplan
          path: tfplan

Split state into manageable pieces

Don't put everything in one giant state file. Break it up by:

Environment (dev, staging, prod)
Layer (networking, compute, data)
Team ownership (platform, security, apps)

terraform/
├── networking/
│   ├── backend.tf
│   └── main.tf
├── compute/
│   ├── backend.tf
│   └── main.tf
└── data/
    ├── backend.tf
    └── main.tf

This limits blast radius and improves performance. Just be careful with cross-stack dependencies.

WRAPPING UP

Terraform state is a necessary evil. It's the price you pay for declarative infrastructure management. The good news is that the benefits—dependency tracking, collaboration, drift detection—outweigh the operational overhead once you've got proper workflows in place.

The bad news is that you'll definitely learn these lessons the hard way at least once. When you do, remember:

Remote state with versioning and locking is non-negotiable
Secrets in state files are a security risk—treat state like sensitive data
State corruption happens—have backups and know how to restore
Manual state operations are powerful and dangerous—measure twice, cut once

And if you ever find yourself running terraform destroy in production, maybe take a coffee break first and verify you're in the right directory. Your future self will thank you.

Terraform state: The good, the bad, and the ugly

THE GOOD: WHY STATE EXISTS IN THE FIRST PLACE

It solves the "what's actually deployed" problem

It tracks resource dependencies

It enables collaboration (sort of)

THE BAD: COMMON PAIN POINTS THAT'LL MAKE YOU QUESTION YOUR LIFE CHOICES

State locking conflicts

State drift: When reality diverges from code

Team collaboration nightmares

Performance degradation at scale

THE UGLY: SECURITY DISASTERS AND STATE CORRUPTION

Sensitive data exposure

State corruption and the recovery nightmare

The "someone deleted production" disaster

BEST PRACTICES: MAKING PEACE WITH STATE

Always use remote backends

State locking with Azure blob leases

Never commit secrets to Terraform config

Master state manipulation commands

Implement proper CI/CD workflows

Split state into manageable pieces

WRAPPING UP

More from this blog

Terraform: Eliminating phantom diffs using ignore_changes and replace_triggered_by

Managing Terraform Phantom Diffs: A Practical Guide

Understanding WASM and WASI: A complete guide

MicroVMs: the security-first alternative to containers

Command Palette

THE GOOD: WHY STATE EXISTS IN THE FIRST PLACE

It solves the "what's actually deployed" problem

It tracks resource dependencies

It enables collaboration (sort of)

THE BAD: COMMON PAIN POINTS THAT'LL MAKE YOU QUESTION YOUR LIFE CHOICES

State locking conflicts

State drift: When reality diverges from code

Team collaboration nightmares

Performance degradation at scale

THE UGLY: SECURITY DISASTERS AND STATE CORRUPTION

Sensitive data exposure

State corruption and the recovery nightmare

The "someone deleted production" disaster

BEST PRACTICES: MAKING PEACE WITH STATE

Always use remote backends

State locking with Azure blob leases

Never commit secrets to Terraform config

Master state manipulation commands

Implement proper CI/CD workflows

Split state into manageable pieces

WRAPPING UP

More from this blog